November 22, 2016 Subset Lab

The relationship between super fund performance and fees

Tools: D3.js, Python incl. Pandas, NumPy, SciPy, Matplotlib libraries

Even modest differences in fees can significantly affect the future return of a superannuation fund. Here we look at the relationship between performance and fees over multiple fund types using five years of data for 166 funds. Only funds where performance data was available for all years between 2011 to 2015 were included. Of particular interest is the sensitivity of fund returns for a given range of fees and the reliability of using fees to predict the gross rate of return.

There is significant differences between performance across fund types. A useful way to illustrate this is by comparing the cumulative distribution functions (CDFs) of the three largest types.

Each CDF represents the average five year distribution of returns for each fund type. The average five year return was obtained by calculating the geometric mean (GM) over the given period.

\[\text{GM}_{\text{returns}} = \prod_{i=1}^n(1 + r_i)^\frac{1}n - 1 \]

Fees were also compounded to obtain the average rate over five years. This time treating the cash flows as a deduction.

\[\text{GM}_{\text{fees}} = \prod_{i=1}^n(1 - r_i)^\frac{1}n - 1 \]

We can illustrate the impact of high fee funds on returns by comparing the CDFs of all funds and the subset of funds with fees greater than 2%.

The effect on returns is further accentuated by directly comparing low and high fee funds, namely funds with fees less than 1% and greater than 2%.

We can see below that the majority of returns for high fee funds range from 2% to 6% whereas low fee funds generally have five year average returns in the order of 6% to 9%.

Multivariate regression analysis of various fee types and the rate of return was performed using the linear least squares method. Fee types which are not distributed across all fund members were excluded.

\[y = β_{0} + β_{1}\text{fees}_{\text{operating}} + β_{2}\text{fees}_{\text{investment}}\]

A significant correlation is expected as the rate of return is inclusive of fees. We will later alter the model to use the gross rate of return to produce more meaningful results. A superior method yet would be to use alpha derived from CAPM, although outside the scope of this analysis.

\[α_{i,t} = r_{i,t} - [r_{f,t} + β(M_{t} - r_{f,t})] \]

Using net returns, we can observe the average five year rate of return against fees.

Unsurprisingly there is a strong correlation (r = -0.53), with a significant fit of our model (R2 = 0.28, p < 0.001, 95% CI slope -0.32, -0.19).

Similarly, we can observe average five year rate of return against operating expenses.

This results in a stronger fit (R2 = 0.33, p < 0.001) as investment expenses generally have an inverse relationship with returns, leading to a higher correlation when this feature is excluded.

It would be of more interest to observe the relationship between the gross rate of return and fees as this separates returns and fees into distinct components. Members might tolerate higher fees for superior returns, especially in the form of a performance bonus. It could then be reasonable to expect a positive rather than a negative correlation.

The relationship is not found to be statistically significant (R2 = 0.00, p = 0.50), therefore higher fees may not be reliable predictor of superior gross returns.

In the same context, it would be of interest to observe the proportion of gross returns that are eroded by the two major fee components. Retail eligible rollover fund returns are eroded by a considerable 42% after fees. This compares to a 14% reduction for regular retail funds, 11% for industry funds, 7% for corporate funds and 6% in the case of public sector funds.

Finally, it is worthwhile to point out some limitations of this analysis. Returns and fees are not the only criteria when considering a super fund, insurance policies and other benefits are usually also factored in. Risk appetite, fees that are not absorbed by all members and investment options other than the default offering are also not considered here.

How I did the analysis

Data was derived from annual fund-level superannuation statistics provided by APRA. As of the time of this writing, industry aggregated data is available up until June 2016, although it does not provide the granularity needed at the fund-level. As a result, the most recent usable data covers the year ending June 2015, and data for 2011 to 2014 was only obtainable by externally searching the APRA file server for the relevant files, as the direct link is no longer publicly available for these years.

The basic work-flow was to perform exploratory analysis in iPython, merging data for all years into a single Pandas dataframe, scoping meaningful results with Matplotlib, exporting production ready data in tab-separated value format and developing visualisations with D3.js.

First we set up our environment.

import math import sys import pandas as pd import numpy as np from scipy import stats import random import matplotlib import matplotlib.pyplot as plt

We can default to the inline matplotlib viewer with modified style.

%matplotlib inline plt.style.use('ggplot')

After we prepare the individual CSV files containing fund-level performance data for each year, we can import the data. For instance, 2011 is imported as follows.

y11 = pd.read_csv('2011.csv')

Merge all years into a single dataframe by matching only funds with a continuous five years of data available. We can achieve this by performing an inner join for each ABN, which is unique for each fund.

df = df.merge(y11, on='Fund ABN', how='inner') df = df.merge(y12, on='Fund ABN', how='inner') df = df.merge(y13, on='Fund ABN', how='inner') df = df.merge(y14, on='Fund ABN', how='inner') df = df.merge(y15, on='Fund ABN', how='inner')

Expand the maximum display to view all columns and rows.

pd.set_option('display.max_columns', 31) pd.set_option('display.max_rows', 200)

Drop duplicate columns and NaN values.

df = df.drop(['Fund name 2', 'Fund type 2', 'Fund name 3', 'Fund type 3', 'Fund name 4', 'Fund type 4', 'Fund name 5', 'Fund type 5'], inplace=True, axis=1) df = df.dropna()

At this stage it's a good idea to make sure all data series produced so far are in the correct data type using df.dtypes. Incorrect data types, in this case numeric fields that should've floating point, can be changed via df.astype().

Next, create new columns with the average rate of return (RoR) and fees.

df['ror avg'] = df[['2011 Rate of return', '2012 Rate of return', '2013 Rate of return', '2014 Rate of return', '2015 Rate of return']].mean(axis=1) df['fees avg'] = df[['2011 Fees', '2012 Fees', '2013 Fees', '2014 Fees', '2015 Fees']].mean(axis=1)

Identify and remove outliers.

df[df['ror avg'] < 0] df = df[df['Fund ABN'] != 60532453567]

We can now perform single variable analysis with our basic dataframe.

Explore Pandas histograms of returns and fees.

df.hist(column='ror avg', bins=50) df.hist(column='fees avg', bins=50)

Histograms in NumPy.

hst, bin_edges = np.histogram(df['ror avg'], bins = 50) plt.bar(bin_edges[:-1], hst, width = 0.25) hst, bin_edges = np.histogram(df['fees avg'], bins = 50) plt.bar(bin_edges[:-1], hst, width = 0.15)

Explore probability mass functions (PMFs) for RoR and fees.

hst, bin_edges = np.histogram(df['ror avg'], bins = 50, density=True) plt.bar(bin_edges[:-1], hst, width = 0.25) hst, bin_edges = np.histogram(df['fees avg'], bins = 50, density=True) plt.bar(bin_edges[:-1], hst, width = 0.15)

We can construct normalised CDFs for RoR and fees, grouped by the three major fund types. The data is then exported in TSV format for later D3 production.

hst, bin_edges, cdf, scale, ncdf = ([0,1,2] for x in range(5)) hst[0], bin_edges[0] = np.histogram(df[df['Fund type 1']=='Retail']['ror avg'], bins=8, density=True) hst[1], bin_edges[1] = np.histogram(df[df['Fund type 1']=='Industry']['ror avg'], bins=8, density=True) hst[2], bin_edges[2] = np.histogram(df[df['Fund type 1']=='Corporate']['ror avg'], bins=8, density=True) for x in range(3): cdf[x] = np.cumsum(hst[x]) scale[x] = 1.0/cdf[x][-1] ncdf[x] = scale[x] * cdf[x] plt.plot(bin_edges[0][:-1], ncdf[0], linewidth=3, color='#08519c', label='Retail') plt.plot(bin_edges[1][:-1], ncdf[1], linewidth=3, color='#4292c6', label='Industry') plt.plot(bin_edges[2][:-1], ncdf[2], linewidth=3, color='#9ecae1', label='Corporate') plt.xlabel('% Average annual return') plt.ylabel('Fraction of super funds') plt.legend(loc='upper left') np.savetxt('tsv4.tsv', ncdf[0], delimiter='\t') np.savetxt('tsv5.tsv', bin_edges[0][:-1], delimiter='\t') np.savetxt('tsv6.tsv', ncdf[1], delimiter='\t') np.savetxt('tsv7.tsv', bin_edges[1][:-1], delimiter='\t') np.savetxt('tsv8.tsv', ncdf[2], delimiter='\t') np.savetxt('tsv9.tsv', bin_edges[2][:-1], delimiter='\t')

This time we group by all funds and only funds with fees greater than 2%, then export data.

hst, bin_edges, cdf, scale, ncdf = ([0,1] for x in range(5)) hst[0], bin_edges[0] = np.histogram(df['ror avg'], bins=8, density=True) hst[1], bin_edges[1] = np.histogram(df[df['fees avg'] > 2]['ror avg'], bins=8, density=True) for x in range(2): cdf[x] = np.cumsum(hst[x]) scale[x] = 1.0/cdf[x][-1] ncdf[x] = scale[x] * cdf[x] plt.plot(bin_edges[0][:-1], ncdf[0], linewidth=3, color='#08519c', label='All funds') plt.plot(bin_edges[1][:-1], ncdf[1], linewidth=3, color='#4292c6', label='Fees greater than 2%') plt.xlabel('% Average annual return') plt.ylabel('Fraction of super funds') plt.legend(loc='upper left') np.savetxt('tsv10.tsv', ncdf[0], delimiter='\t') np.savetxt('tsv11.tsv', bin_edges[0][:-1], delimiter='\t') np.savetxt('tsv12.tsv', ncdf[1], delimiter='\t') np.savetxt('tsv13.tsv', bin_edges[1][:-1], delimiter='\t')

Finally, for our last CDFs we group by fees less than 1% and greater than 2%.

hst, bin_edges, cdf, scale, ncdf = ([0,1] for x in range(5)) hst[0], bin_edges[0] = np.histogram(df[df['fees avg'] > 2]['ror avg'], bins=8, density=True) hst[1], bin_edges[1] = np.histogram(df[df['fees avg'] < 1]['ror avg'], bins=8, density=True) for x in range(2): cdf[x] = np.cumsum(hst[x]) scale[x] = 1.0/cdf[x][-1] ncdf[x] = scale[x] * cdf[x] plt.plot(bin_edges[0][:-1], ncdf[0], linewidth=3, color='#08519c', label='Fees greater than 2%') plt.plot(bin_edges[1][:-1], ncdf[1], linewidth=3, color='#4292c6', label='Fees less than 1%') plt.xlabel('% Average annual return') plt.ylabel('Fraction of super funds') plt.legend(loc='upper left') np.savetxt('tsv14.tsv', ncdf[0], delimiter='\t') np.savetxt('tsv15.tsv', bin_edges[0][:-1], delimiter='\t') np.savetxt('tsv16.tsv', ncdf[1], delimiter='\t') np.savetxt('tsv17.tsv', bin_edges[1][:-1], delimiter='\t')

We can move on to preparing the production data for our scatterplots, but first we need create some necessary columns that are missing.

Create columns for compound (geometric) average for RoR and fees, after moving the simple averages to new columns.

df['ror savg'] = df['ror avg'] df['ror avg'] = (((df['2011 Rate of return']/100)+1)*((df['2012 Rate of return']/100)+1)*((df['2013 Rate of return']/100)+1)*((df['2014 Rate of return']/100)+1)*((df['2015 Rate of return']/100)+1)) df['ror avg'] = df['ror avg']**0.2 df['ror avg'] = (df['ror avg']-1)*100 df['fees savg'] = df['fees avg'] df['fees avg'] = (((-1*df['2011 Fees'])/100)+1)*(((-1*df['2012 Fees'])/100)+1)*(((-1*df['2013 Fees'])/100)+1)*(((-1*df['2014 Fees'])/100)+1)*(((-1*df['2015 Fees'])/100)+1)**0.2 df['fees avg'] = df['fees avg']**0.2 df['fees avg'] = (df['fees avg']-1)*100

Create columns for investment and operating expense averages.

df['ie avg'] = df[['2011 Investment expense ratio', '2012 Investment expense ratio', '2013 Investment expense ratio', '2014 Investment expense ratio', '2015 Investment expense ratio']].mean(axis=1) df['oe avg'] = df[['2011 operating expense ratio', '2012 operating expense ratio', '2013 operating expense ratio', '2014 Operating expense ratio', '2015 Operating expense ratio']].mean(axis=1) df['ie avg'] = df['ie avg']*100 df['oe avg'] = df['oe avg']*100

Create column for gross RoR.

df['ror gavg'] = df['ror avg'] + df['fees avg']

Obtain regression coefficients and calculate R2 for our first scatter plot.

pf_slope, pf_intercept, pf_r_value, pf_p_value, pf_std_err = stats.linregress(df['ror avg'], df['fees avg']) pf_R2_value = pf_r_value**2

Calculate the 95% confidence interval.

pf_slope-1.96*pf_std_err pf_slope+1.96*pf_std_err

We can now view and export data for our first scatterplot which looks at the relationship between fees and RoR, grouped by different fund types. We also output the slope and intercept for our best fit line as we will need to add this separately in D3.js.

fig, ax = plt.subplots() plt.scatter(x=df[df['Fund type 1']=='Retail']['ror avg'], y=df[df['Fund type 1']=='Retail']['fees avg'], color='#1b9e77', label='Retail') plt.scatter(x=df[df['Fund type 1']=='Retail - ERF']['ror avg'], y=df[df['Fund type 1']=='Retail - ERF']['fees avg'], color='#d95f02', label='Retail (ERF)') plt.scatter(x=df[df['Fund type 1']=='Industry']['ror avg'], y=df[df['Fund type 1']=='Industry']['fees avg'], color='#7570b3', label='Industry') plt.scatter(x=df[df['Fund type 1']=='Corporate']['ror avg'], y=df[df['Fund type 1']=='Corporate']['fees avg'], color='#e7298a', label='Corporate') plt.scatter(x=df[df['Fund type 1']=='Public Sector']['ror avg'], y=df[df['Fund type 1']=='Public Sector']['fees avg'], color='#66a61e', label='Public Sector') plt.plot(df['ror avg'], pf_slope*df['ror avg']+pf_intercept, linewidth=2, c='#08519c') plt.xlabel('% Average annual return') plt.ylabel('% Average fees') plt.legend(loc='upper right') print pf_slope print pf_intercept header = ['ror avg', 'fees avg', 'Fund type 1'] df.to_csv('fig4.tsv', sep='\t', index=False, columns = header)

Obtain regression coefficients and calculate R2 for our second scatterplot, which looks at the relationship between operating expenses and RoR, grouped by fund type.

e1_slope, e1_intercept, e1_r_value, e1_p_value, e1_std_err = stats.linregress(df['ror avg'], df['oe avg']) e1_R2_value = e1_r_value**2

Calculate the 95% confidence interval.

e1_slope-1.96*pf_std_err e1_slope+1.96*pf_std_err

View and export data for our second scatterplot.

fig, ax = plt.subplots() plt.scatter(x=df[df['Fund type 1']=='Retail']['ror avg'], y=df[df['Fund type 1']=='Retail']['oe avg'], color='#1b9e77', label='Retail') plt.scatter(x=df[df['Fund type 1']=='Retail - ERF']['ror avg'], y=df[df['Fund type 1']=='Retail - ERF']['oe avg'], color='#d95f02', label='Retail (ERF)') plt.scatter(x=df[df['Fund type 1']=='Industry']['ror avg'], y=df[df['Fund type 1']=='Industry']['oe avg'], color='#7570b3', label='Industry') plt.scatter(x=df[df['Fund type 1']=='Corporate']['ror avg'], y=df[df['Fund type 1']=='Corporate']['oe avg'], color='#e7298a', label='Corporate') plt.scatter(x=df[df['Fund type 1']=='Public Sector']['ror avg'], y=df[df['Fund type 1']=='Public Sector']['oe avg'], color='#66a61e', label='Public Sector') plt.plot(df['ror avg'], e1_slope*df['ror avg']+e1_intercept, linewidth=2, c='#08519c') plt.xlabel('% Average annual return') plt.ylabel('% Operating expense ratio') plt.legend(loc='upper right') print e1_slope print e1_intercept header = ['ror avg', 'oe avg', 'Fund type 1'] df.to_csv('fig5.tsv', sep='\t', index=False, columns = header)

Obtain regression coefficients and calculate R2 for our third scatterplot, which looks at the relationship between fees and gross RoR, grouped by fund type.

gpf_slope, gpf_intercept, gpf_r_value, gpf_p_value, gpf_std_err = stats.linregress(df['ror gavg'], df['fees avg']) gpf_R2_value = gpf_r_value**2

View and export data for our third scatterplot.

fig, ax = plt.subplots() plt.scatter(x=df[df['Fund type 1']=='Retail']['ror gavg'], y=df[df['Fund type 1']=='Retail']['fees avg'], color='#1b9e77', label='Retail') plt.scatter(x=df[df['Fund type 1']=='Retail - ERF']['ror gavg'], y=df[df['Fund type 1']=='Retail - ERF']['fees avg'], color='#d95f02', label='Retail (ERF)') plt.scatter(x=df[df['Fund type 1']=='Industry']['ror gavg'], y=df[df['Fund type 1']=='Industry']['fees avg'], color='#7570b3', label='Industry') plt.scatter(x=df[df['Fund type 1']=='Corporate']['ror gavg'], y=df[df['Fund type 1']=='Corporate']['fees avg'], color='#e7298a', label='Corporate') plt.scatter(x=df[df['Fund type 1']=='Public Sector']['ror gavg'], y=df[df['Fund type 1']=='Public Sector']['fees avg'], color='#66a61e', label='Public Sector') plt.plot(df['ror gavg'], gpf_slope*df['ror gavg']+gpf_intercept, linewidth=2, c='#08519c') plt.xlabel('% Average annual return') plt.ylabel('% Average fees') plt.legend(loc='upper right') print gpf_slope print gpf_intercept header = ['ror gavg', 'fees avg', 'Fund type 1'] df.to_csv('fig6.tsv', sep='\t', index=False, columns = header)

Finally, calculate the average proportion of returns that are eroded by fees for each fund type.

df['fees prop'] = (df['fees avg']/df['ror gavg']) df[df['Fund type 1']=='Retail - ERF']['fees prop'].mean() df[df['Fund type 1']=='Retail']['fees prop'].mean() df[df['Fund type 1']=='Industry']['fees prop'].mean() df[df['Fund type 1']=='Corporate']['fees prop'].mean() df[df['Fund type 1']=='Public Sector']['fees prop'].mean()

We can now construct the CDFs and scatterplots from our exported data. Only the first of each kind is shown here, as the remaining only require a small amount of modification to port for other data series.

First set the CSS file (chart.css) with altered styles for the axes, paths, dots, lines and legends.

.axis path, .axis line { fill: none; stroke: #bdbdbd; shape-rendering: crispEdges; } #axisy, #axisx { font-size: 17px; fill: #333; } #dot { stroke: none; } #legend { font-size: 17px; fill: #333; } .x.axis path, .x.axis line { stroke: #bdbdbd; } .y.axis path { stroke: #fff; } .line { fill: none; } #axisy1, #axisx1 { font-size: 17px; fill: #333; }

Moving on to our JS file. Beginning with the first CDF, set the SVG margins, axes and lines. Remember to add the javascript tags and the path to D3.

var margin = {top: 20, right: 10, bottom: 30, left: 35}, width = 670, height = 400; var x = d3.scale.linear() .range([0, width]); var y = d3.scale.linear() .range([height, 0]); var color = d3.scale.category20(); var xAxis = d3.svg.axis() .scale(x) .orient("bottom"); var yAxis = d3.svg.axis() .scale(y) .orient("left"); var line1 = d3.svg.line() .interpolate("basis") .x(function(d) { return x(d.bin1); }) .y(function(d) { return y(d.cdf1); }); var line2 = d3.svg.line() .interpolate("basis") .x(function(d) { return x(d.bin2); }) .y(function(d) { return y(d.cdf2); }); var line3 = d3.svg.line() .interpolate("basis") .x(function(d) { return x(d.bin3); }) .y(function(d) { return y(d.cdf3); }); var svg = d3.select("#fig1").append("svg") .attr("width", width + margin.left + margin.right) .attr("height", height + margin.top + margin.bottom) .append("g") .attr("transform", "translate(" + margin.left + "," + margin.top + ")");

Import data from the TSV file.

d3.tsv("fig1.tsv", function(error, data) { if (error) throw error; data.forEach(function(d) { d.cdf1 = +d.cdf1; d.bin1 = +d.bin1; d.cdf2 = +d.cdf2; d.bin2 = +d.bin2; d.cdf3 = +d.cdf3; d.bin3 = +d.bin3; });

Set x and y domains and append both axes.

x.domain([0,11]).nice(); y.domain([0.0,1.0]).nice(); svg.append("g") .attr("class", "x axis") .attr("transform", "translate(0," + height + ")") .attr("id", "axisx1") .call(xAxis) .append("text") .attr("x", width) .attr("y", -6) .style("text-anchor", "end") .text("ror avg"); svg.append("g") .attr("class", "y axis") .attr("id", "axisy1") .call(yAxis) .append("text") .attr("transform", "rotate(-90)") .attr("y", 6) .attr("dy", ".71em") .style("text-anchor", "end") .text("Fraction of super funds");

Set colour style for legend.

svg.append("g") .style("legend", function(d) { return color(); });

Append each CDF.

svg.append("path") .datum(data) .attr("class", "line") .attr("d", line1) .style("stroke-width", "1.5px") .style("stroke", "#1F77B4"); svg.append("path") .datum(data) .attr("class", "line") .attr("d", line2) .style("stroke-width", "1.5px") .style("stroke", "#FF7F0E"); svg.append("path") .datum(data) .attr("class", "line") .attr("d", line3) .style("stroke-width", "1.5px") .style("stroke", "#2CA02C");

Append the legend. Note that this is not the optimal method, but it's more straight forward given the way the data was loaded into the DOM.

var legend = svg.selectAll(".legend") .data(color.domain()) .enter().append("g") .attr("class", "legend"); legend.append("rect") .attr("x", 33) .attr("width", 18) .attr("height", 18) .style("fill", "#1F77B4"); legend.append("text") .attr("x", 56) .attr("y", 10) .attr("dy", ".35em") .attr("id", "legend") .style("text-anchor", "start") .text(function(d) { return "Retail"; }); legend.append("rect") .attr("x", 33) .attr("y", 20) .attr("width", 18) .attr("height", 18) .style("fill", "#FF7F0E"); legend.append("text") .attr("x", 56) .attr("y", 30) .attr("dy", ".35em") .attr("id", "legend") .style("text-anchor", "start") .text(function(d) { return "Industry"; }); legend.append("rect") .attr("x", 33) .attr("y", 40) .attr("width", 18) .attr("height", 18) .style("fill", "#2CA02C"); legend.append("text") .attr("x", 56) .attr("y", 50) .attr("dy", ".35em") .attr("id", "legend") .style("text-anchor", "start") .text(function(d) { return "Corporate"; }); });

Next we construct the first scatterplot, which again will be used as a prototype for the charts not included here.

Set the SVG margins, axes and legend colours. Remember to add the javascript tags and the path to D3.

var margin = {top: 20, right: 10, bottom: 30, left: 35}, width = 670, height = 400; var x = d3.scale.linear() .range([0, width]); var y = d3.scale.linear() .range([height, 0]); var color = d3.scale.category20().domain(['Retail', 'Retail - ERF', 'Industry','Public Sector', 'Corporate']); var xAxis = d3.svg.axis() .scale(x) .orient("bottom"); var yAxis = d3.svg.axis() .scale(y) .orient("left"); var svg = d3.select("#fig5").append("svg") .attr("width", width + margin.left + margin.right) .attr("height", height + margin.top + margin.bottom) .append("g") .attr("transform", "translate(" + margin.left + "," + margin.top + ")");

Import data from the TSV file.

d3.tsv("fig5.tsv", function(error, data) { if (error) throw error; data.forEach(function(d) { d.feesavg = +d.feesavg; d.roravg = +d.roravg; });

Set the x and y domains and append both axes.

x.domain(d3.extent(data, function(d) { return d.roravg; })).nice(); y.domain(d3.extent(data, function(d) { return d.feesavg-0.5; })).nice(); svg.append("g") .attr("class", "x axis") .attr("transform", "translate(0," + height + ")") .attr("id", "axisx") .call(xAxis) .append("text") .attr("class", "label") .attr("x", width) .attr("y", -6) .style("text-anchor", "end") .text("ror avg"); svg.append("g") .attr("class", "y axis") .attr("id", "axisy") .call(yAxis) .append("text") .attr("class", "label") .attr("transform", "rotate(-90)") .attr("y", 6) .attr("dy", ".71em") .style("text-anchor", "end") .text("oe avg")

Append the dots.

svg.selectAll(".dot") .data(data) .enter().append("circle") .attr("class", "dot") .attr("r", 3.5) .attr("cx", function(d) { return x(d.roravg); }) .attr("cy", function(d) { return y(d.feesavg); }) .attr("id", "dot") .style("fill", function(d) { return color(d.fundtype1); });

Set the best fit line manually using y=mx+b, the slope and y-intercept was obtained when the regression was run.

svg.append("line") .attr("x1", x(-2)) .attr("x2", x(11)) .attr("y1", y(-0.2768*-2+2.95968)) .attr("y2", y(-0.2768*11+2.95968)) .style("stroke", "#969696") .style("stroke-width", "1.5px");

Finally, append the legend.

var legend = svg.selectAll(".legend") .data(color.domain()) .enter().append("g") .attr("class", "legend") .attr("transform", function(d, i) { return "translate(0," + i * 20 + ")"; }); legend.append("rect") .attr("x", width - 18) .attr("width", 18) .attr("height", 18) .style("fill", color); legend.append("text") .attr("x", width - 24) .attr("y", 9) .attr("dy", ".35em") .attr("id", "legend") .style("text-anchor", "end") .text(function(d) { return d; }); });

Optional: You can make all the charts here responsive, so that they resize correctly on mobile devices.

First, add the following CSS. The bottom padding refers to the aspect ratio we need to maintain in relation to the width.

.svg-container { display: inline-block; position: relative; width: 100%; padding-bottom: 63.1944%; vertical-align: top; overflow: hidden; } .svg-content-responsive { display: inline-block; position: absolute; top: 0px; left: 0px; right: 5px; }

Then redo the SVG variable for each figure. In this case, we place the SVG object in a resizable container rather than using a fixed width and height. For example, the SVG object for our first CDF can be modified as follows.

var svg = d3.select("#fig1") .append("div") .classed("svg-container", true) .append("svg") .attr("preserveAspectRatio", "xMinYMin meet") .attr("viewBox","0 0 " + (width + margin.left + margin.right) + " " + (height + margin.top + margin.bottom)) .classed("svg-content-responsive", true) .append("g") .attr("transform", "translate(" + margin.left + "," + margin.top + ")");

And that's it.

About Martin

I'm a developer and data analyst based in Sydney. Currently working part time on a multi-platform fintech product built on Ionic, Angular and Node. I have a bachelor's degree in finance from Macquarie University and postgraduate studies in IT from UNSW. Follow me on Twitter or .