Analysing C# code on GitHub with BigQuery

Just over a year ago Google made all the open source code on GitHub available for querying within BigQuery and as if that wasn’t enough you can run a terabyte of queries each month for free!

So in this post I am going to be looking at all the C# source code on GitHub and what we can find out from it. Handily a smaller, C# only, dataset has been made available (in BigQuery you are charged per byte read), called fh-bigquery:github_extracts.contents_net_cs and has

5,885,933 unique ‘.cs’ files

unique ‘.cs’ files 792,166,632 lines of code (LOC)

lines of code (LOC) 37.17 GB of data

Which is a pretty comprehensive set of C# source code!

The rest of this post will attempt to answer the following questions:

Then moving onto some less controversial C# topics:

Before we end up looking at repositories, not just individual C# files:

If you want to try the queries for yourself (or find my mistakes), all of them are available in this gist. There’s a good chance that my regular expressions miss out some edge-cases, after all Regular Expressions: Now You Have Two Problems:

Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.

Tabs or Spaces?

In the entire data-set there are 5,885,933 files, but here we only include ones that have more than 10 lines starting with a tab or a space

Tabs Tabs % Spaces Spaces % Total 799,055 17.15% 3,859,528 82.85% 4,658,583

Clearly, C# developers (on GitHub) prefer Spaces over Tabs, let the endless debates continue!! (I think some of this can be explained by the fact that Visual Studio uses ‘spaces’ by default)

If you want to see how C# compares to other programming languages, take a look at 400,000 GitHub repositories, 1 billion files, 14 terabytes of code: Spaces or Tabs?.

regions : ‘should be banned’ or ‘okay in some cases’?

It turns out that there are an impressive 712,498 C# files (out of 5.8 million) that contain at least one #region statement (query used), that’s just over 12%. (I’m hoping that a lot of those files have been auto-generated by a tool!)

‘K&R’ or ‘Allman’, where do C# devs like to put their braces?

C# developers overwhelmingly prefer putting an opening brace { on it’s own line (query used)

separate line same line same line (initializer) total (with brace) total (all code) 81,306,320 (67%) 40,044,603 (33%) 3,631,947 (2.99%) 121,350,923 (15.32%) 792,166,632

(‘same line initializers’ include code like new { Name = "", .. } , new [] { 1, 2, 3.. } )

Do C# developers like writing functional code?

This is slightly unscientific, but I wanted to see how widely the Lambda Operator => is used in C# code (query). Yes, I know, if you want to write functional code on .NET you really should use F#, but C# has become more ‘functional’ over the years and I wanted to see how much code was taking advantage of that.

Here’s the raw percentiles:

Percentile % of lines using lambdas 10 0.51 25 1.14 50 2.50 75 5.26 90 9.95 95 14.29 99 28.00

So we can say that:

50% of all the C# code on GitHub uses => on 2.44% (or less) of their lines.

on 2.44% (or less) of their lines. 10% of all C# files have lambdas on almost 1 in 10 of their lines (9.95%)

5% use => on 1 in 7 lines (14.29%)

on 1 in 7 lines (14.29%) 1% of files have lambdas on over 1 in 3 lines (28%) of their lines of code, that’s pretty impressive!

Which using statements are most widely used?

Now on to some a bit more substantial, what are the most widely used using statements in C# code?

The top 10 looks like this (the full results are available):

using statement count using System.Collections.Generic; 1,780,646 using System; 1,477,019 using System.Linq; 1,319,830 using System.Text; 902,165 using System.Threading.Tasks; 628,195 using System.Runtime.InteropServices; 431,867 using System.IO; 407,848 using System.Runtime.CompilerServices; 338,686 using System.Collections; 289,867 using System.Reflection; 218,369

However, as was pointed out, the top 5 are included by default when you add a new file in Visual Studio and many people wouldn’t remove them. The same applies to ‘System.Runtime.InteropServices’ and ‘System.Runtime.CompilerServices’ which are include in ‘AssemblyInfo.cs` by default.

So if we adjust the list to take account of this, the top 10 looks like so:

using statement count using System.IO; 407,848 using System.Collections; 289,867 using System.Reflection; 218,369 using System.Diagnostics; 201,341 using System.Threading; 179,168 using System.ComponentModel; 160,681 using System.Web; 160,323 using System.Windows.Forms; 137,003 using System.Globalization; 132,113 using System.Drawing; 127,033

Finally, an interesting list is the top 10 using statements that aren’t System , Microsoft or Windows namespaces:

using statement count using NUnit.Framework; 119,463 using UnityEngine; 117,673 using Xunit; 99,099 using Newtonsoft.Json; 81,675 using Newtonsoft.Json.Linq; 29,416 using Moq; 23,546 using UnityEngine.UI; 20,355 using UnityEditor; 19,937 using Amazon.Runtime; 18,941 using log4net; 17,297

What NuGet packages are most often included in a .NET project?

It turns out that there is also a separate dataset containing all the ‘packages.config’ files on GitHub, it’s called contents_net_packages_config and has 104,808 entries. By querying this we can see that Json.Net is the clear winner!!

package count Newtonsoft.Json 45,055 Microsoft.Web.Infrastructure 16,022 Microsoft.AspNet.Razor 15,109 Microsoft.AspNet.WebPages 14,495 Microsoft.AspNet.Mvc 14,236 EntityFramework 14,191 Microsoft.AspNet.WebApi.Client 13,480 Microsoft.AspNet.WebApi.Core 12,210 Microsoft.Net.Http 11,625 jQuery 10,646 Microsoft.Bcl.Build 10,641 Microsoft.Bcl 10,349 NUnit 10,341 Owin 9,681 Microsoft.Owin 9,202 Microsoft.AspNet.WebApi.WebHost 9,007 WebGrease 8,743 Microsoft.AspNet.Web.Optimization 8,721 Microsoft.AspNet.WebApi 8,179

How many lines of code (LOC) are in a typical C# file?

Are C# developers prone to creating huge files that go one for 1000’s of lines? Well some are but fortunately it’s the minority of us!!

Note the Y-axis is ‘lines of code’ and is logarithmic, the raw data is available.

Oh dear, Uncle Bob isn’t going to be happy, whilst 96% of the files have 509 LOC of less, the other 4% don’t!! From Clean Code:

And in case you’re wondering, here’s the Top 10 longest C# files!!

File Lines MarMot/Input/test.marmot.cs 92663 src/CodenameGenerator/WordRepos/LastNamesRepository.cs 88810 cs_inputtest/cs_02_7000.cs 63004 cs_inputtest/cs_02_6000.cs 54004 src/ML NET20/Utility/UserName.cs 52014 MWBS/Dictionary/DefaultWordDictionary.cs 48912 Sources/Accord.Math/Matrix/Matrix.Comparisons1.Generated.cs 48407 UrduProofReader/UrduLibs/Utils.cs 48255 cs_inputtest/cs_02_5000.cs 45004 css/style.cs 44366

What is the most widely thrown Exception ?

There’s a few interesting results in this query, for instance who knew that so many ApplicationExceptions were thrown and NotSupportedException being so high up the list is a bit worrying!!

Exception count throw new ArgumentNullException 699,526 throw new ArgumentException 361,616 throw new NotImplementedException 340,361 throw new InvalidOperationException 260,792 throw new ArgumentOutOfRangeException 160,640 throw new NotSupportedException 110,019 throw new HttpResponseException 74,498 throw new ValidationException 35,615 throw new ObjectDisposedException 31,129 throw new ApplicationException 30,849 throw new UnauthorizedException 21,133 throw new FormatException 19,510 throw new SerializationException 17,884 throw new IOException 15,779 throw new IndexOutOfRangeException 14,778 throw new NullReferenceException 12,372 throw new InvalidDataException 12,260 throw new ApiException 11,660 throw new InvalidCastException 10,510

‘async/await all the things’ or not?

The addition of the async and await keywords to the C# language makes writing asynchronous code much easier:

public async Task < int > GetDotNetCountAsync () { // Suspends GetDotNetCountAsync() to allow the caller (the web server) // to accept another request, rather than blocking on this one. var html = await _httpClient . DownloadStringAsync ( "http://dotnetfoundation.org" ); return Regex . Matches ( html , ".NET" ). Count ; }

But how much is it used? Using the query below:

SELECT Count(*) count FROM [fh-bigquery:github_extracts.contents_net_cs] WHERE REGEXP_MATCH(content, r'\sasync\s|\sawait\s')

I found that there are 218,643 files (out of 5,885,933) that have at least one usage of async or await in them.

Do C# developers like using the var keyword?

Less that they use async and await , there are 130,590 files that have at least one usage of the var keyword

Update: thanks for jairbubbles for pointing out that my var regex was wrong and supplying a fixed version!

More than they use async and await , there are 1,457,154 files that have at least one usage of the var keyword

Just how many files should you have in a repository?

90% of the repositories (that have any C# files) have 95 files or less. 95% have 170 files or less and 99% have 535 files or less.

(again the Y-axis (# files) is logarithmic)

The top 10 largest repositories, by number of C# files are shown below:

Repository # Files https://github.com/xen2/mcs 23389 https://github.com/mater06/LEGOChimaOnlineReloaded 14241 https://github.com/Microsoft/referencesource 13051 https://github.com/dotnet/corefx 10652 https://github.com/apo-j/Projects_Working 10185 https://github.com/Microsoft/CodeContracts 9338 https://github.com/drazenzadravec/nequeo 8060 https://github.com/ClearCanvas/ClearCanvas 7946 https://github.com/mwilliamson-firefly/aws-sdk-net 7860 https://github.com/151706061/MacroMedicalSystem 7765

What is the most popular repository with C# code in it?

This time we are going to look at the most popular repositories (based on GitHub ‘stars’) that contain at least 50 C# files (query used):

repo stars files https://github.com/grpc/grpc 11075 237 https://github.com/dotnet/coreclr 8576 6503 https://github.com/dotnet/roslyn 8422 6351 https://github.com/facebook/yoga 8046 73 https://github.com/bazelbuild/bazel 7123 132 https://github.com/dotnet/corefx 7115 10652 https://github.com/SeleniumHQ/selenium 7024 512 https://github.com/Microsoft/WinObjC 6184 81 https://github.com/qianlifeng/Wox 5674 207 https://github.com/Wox-launcher/Wox 5674 142 https://github.com/ShareX/ShareX 5336 766 https://github.com/Microsoft/Windows-universal-samples 5130 1501 https://github.com/NancyFx/Nancy 3701 957 https://github.com/chocolatey/choco 3432 248 https://github.com/JamesNK/Newtonsoft.Json 3340 650

Interesting that the top spot is a Google Repository! (the C# files in it are sample code for using the GRPC library from .NET)

What are the most popular C# class names?

Assuming that I got the regex correct, the most popular C# class names are the following:

Class name Count class C 182480 class Program 163462 class Test 50593 class Settings 40841 class Resources 39345 class A 34687 class App 28462 class B 24246 class Startup 18238 class Foo 15198

Yay for Foo , just sneaking into the Top 10!!

‘Foo.cs’, ‘Program.cs’ or something else, what’s the most common file name?

Finally lets look at the different class names used, as with the using statement they are dominated by the default ones used in the Visual Studio templates:

File Count AssemblyInfo.cs 386822 Program.cs 105280 Resources.Designer.cs 40881 Settings.Designer.cs 35392 App.xaml.cs 21928 Global.asax.cs 16133 Startup.cs 14564 HomeController.cs 13574 RouteConfig.cs 11278 MainWindow.xaml.cs 11169

Discuss this post on Hacker News and /r/csharp

More Information

As always, if you’ve read this far your present is yet more blog posts to read, enjoy!!

How BigQuery Works (only put in at the end of the blog post)

BigQuery analysis of other Programming Languages