In this blog post I will explain a problem we encounter in observational cosmology called photometric redshifts and how we can use Mixture Density Networks (MDN's) to solve them with an implementation in TensorFlow. MDN's are just a different flavour of Neural Network. MDN's in the paper (PDF) by Bishop are applied to a toy problem trying to infer the position of a robotic arm. In this blog post I wanted to show the usage of MDN's on a real world problem, and with a real world problem I mean a simulated galaxy data set. The code used in this work is based on the second of Itoro's comprehensive blog posts on MDN's, the first one uses theano and can be found here.

If you are a machine learning researcher or enthusiast I hope you can learn a bit about a challenging astronomy problem and maybe you can help me answer some ML question I have posed at the end of this blog. If you work with noisy data and/or just you want full probability density functions (PDF's) I hope you might be able to apply MDN's to your work. If you're not convinced on why you might want a full PDF have a look at this excellent post on utility functions by Rasmus Bååth on the need for PDF's for better decision making.

If you are completely new to TensorFlow have a look at the following (free online) book: First contact with TensorFlow by Jordi Torres, I helped out with the translation from the original Spanish version. If you have no idea about neural nets and TensorFlow, this a very gentle introduction.

This blog post is written in a jupyter notebook and the data used can be downloaded: train data test data.

The photometric redshift problem:¶

In short, we want determine the distances to galaxies using just a few noisy features. The photometric redshift problem happens to be a inverse problem with heteroscedastic noise properties. Heteroscedastic means that the noise properties of the data is not constant. Another way of saying this is that the variability of properties varies within the sample. The inverse part refers to the fact that there are more that one likely solution to the problem, hence the answer might be multimodal. Btw, we will refer to the distance of a galaxy as redshift. This is due to the fact that we live in an expanding universe, hence galaxies that are further away are moving faster away from us and as the galaxies move away from each other the frequency of the light viewed here on earth is shifted towards the red side of the spectrum. So by measuring their spectrum we can measure the redshift and thus the distance. The effect is similar to the doppler shift. This information is not crucial for understand the rest of this blog, but pretty cool, so do yourself a favour and read that wiki page.

Due to nature of the galaxies and due to the noise properties it can happen that a galaxy that is close by (i.e low redshift) looks like a different kind of galaxy that is far away (i.e high redshift) and with the limited features we can measure we have no idea of knowing which it is. So we are interested in estimating the probability density function of the redshift for the galaxies and using MDN's is one possible way of doing so.

You might be wondering why we need the distances to the galaxies, in my case it is to measure the accelerated expansion of the universe a.k.a Dark Energy, and to be more exact I would like to infer the redshift distribution of an ensemble of galaxies. Other astronomer might want to use it other purposes.

The redshift of a galaxy can be measured exactly with a spectrograph. A spectrograph will detect emission lines in the spectrum of the galaxy that allows to precisely determine the redshift. The problem with spectrographs is that the use is expensive in time, meaning you can not observe massive amount of galaxies with them. To be able to measure the properties of massive amount of galaxies (100 million+) we perform photometric surveys, these use large CCD cameras to take images of the sky in several filters. In the following image, the blue, green and red line is the spectrum of a galaxy at different redshifts. With a spectrograph one can measure exact details of these lines and hence the redshift. The 5 grey areas on bottom are the filter response curves used in photometric surveys. This means that in the photometric survey we measure 5 values, one for each of the filters. The measurement are referred to as magnitudes (which happens to be the negative of a log transform of the flux thus high magnitude means a faint galaxy).

As you can see the detailed information of the spectrum of galaxies is lost when using the photometric information, the advantage is that we can observe many more galaxies. But now the problem is that we have to infer the redshift of the galaxies with just those 5 numbers, and as you probably can see from the image, it's not going to be easy. This image is taken from the astroML page that accompanies the excellent book: Statistics, Data Mining, and Machine Learning in Astronomy

To make things more complicated the noise levels within the data sets differ quite significantly, this can be due to that some parts of the sky have been observed with longer exposure times, but even for galaxies with the same amount of exposure time the noise levels will differ based on the brightness and size of the galaxy and on the amount of turbulence in the atmosphere while observing. The good news is that for each of the 5 magnitudes (i.e features) we can also estimate the noise of this measurement, this leaves us with 5 measured magnitudes accompanied with an error estimate totalling to 10 features (we are assuming the noise is not correlated which is a simplification).

Just to recap, from a machine learning standpoint we have the following problem, we have a large data set for which we have measured 5 magnitudes and their respective errors, a subset of these galaxies have also been observed with a spectrograph and hence we know the exact redshift. We want to predict the redshift, so we have a regression problem, but we are not just interested in the most likely redshift but in the full PDF.

Mixture Density Networks¶

This is where MDN's come in, MDN's are very similar to standard Neural Networks with the only difference being that final layer is mapped to a mixture of distributions. This makes them an elegant solution for modelling arbitrary conditional probability distributions like we have here. In our case : $$ p(z \hspace{1mm} | \hspace{1mm} x) = \sum_{k=1}^{K} \pi_k(x) Beta(z |\alpha_k(x), \beta_{k}(x))$$ where z is redshift and x are the measured features with their corresponding errors. So the output of a MDN is $k$ mixture components and the parameters for each of the $k$ distributions. In our case we will be using a mixture of $Beta$ distributions as it suits the purpose of our problem. Here is a little recap on what shape the $Beta$ distribution can take.