Let’s take our beloved palm oil spread, Nutella, as an example.

Here are the ingredients:

Weights

In my linear regression model, the parameters (weights) are the amounts in grams for the different ingredients:

w1 is the amount of sugar in 100g of Nutella,

w2 is the amount of palm oil in 100g of Nutella,

and so on…

In some cases, some percentage are known. It is the case for European Nutella where we know the amounts for the healthier ingredients (hazelnut, cocoa, etc.). In such cases, the weights are set and not trainable.

Input and Output

Now the nutrition facts label:

Each nutritional component becomes a training observation/example (x, y).

Let’s take the “Total Fat” component as an example, it gives us a (x, y) tuple.

x is a row vector containing the percentage of fat in each ingredient:

x1 is the percentage of fat in sugar (0%)

x2 the percentage of fat in palm oil (100%)

…

These one are straightforward. But for some ingredients, it gets harder to guess the composition (lecithin anyone?). For this experiment, I used the USDA National Nutrient Database which contains this information for most basic ingredients.

Note: this database is not a silver bullet, nutritional compositions can vary a lot (different varieties of Hazelnuts exist, you can choose to roast them or not, cocoa can be raw or fat-reduced, …)

On the other side of my very deep 1-layer neural network, y is a scalar, containing the amount of fat in the final product. This information can easily be found in the nutrition facts tables:

12g of fat “per serving” for Nutella, or, in a more civilized nutrition labeling system: 31% (thank you France)

Since this labeling is quite verbose, we can get about ten (x, y) samples.

Training

Of course, in the model, the linear unit has no bias. There is no dark matter to find in food ingredients, everything is accounted for in the weighted sum of the quantities.

Declaring all of this in PyTorch is relatively easy (this is my first time), this library is very natural and straightforward, I think I understand the hype now. Now, should we just reduce the raw L2-loss? No, there are a lot of constraints we have to set on the model for it to converge to a plausible local minima and not to some weird recipe where I have to put a negative amount of cocoa.

Specific Domain Constraints

a mass cannot be negative (no kidding)

some weights are fixed (when a percentage is known)

the sum of the masses must be equal to 100g

and most importantly, weights are to be kept in descending order (the same way food ingredients are listed in descending order on the packaging)

Some of these constraints are enforced when updating the weights and some through alchemist tricks in the loss function. None of this is pretty, I am afraid.

I used the whole dataset (batch gradient descent) to compute the loss function at each step. Here are the results: