$\begingroup$

I was trying to recover a good approximation to $w^*$ from a linear system $Xw^* = y$ where $X$ is $N$ by $D$. However, $w^*$ is generate from some other function $f$ and satisfies:

$$f(w) = w^*$$

so in reality I am solving for $f(w)$ in:

$$ Xf(w) = y $$

note I don't care at all about $w$, just $w^*$.

The exact true $f$ is usually unknown but I usually have strong priors on what $f$. If the linear system is underdetermined $N<D$ then we need some type of bias/regularizer for the system to pick up the right solution. For example, the well studied example of $f(w)=w^*$ is when $f$ chooses sparse elements from $w$. This way orthogonal matching pursuit or LASSO (L1 regularization) are the algorithms/priors of choice. However, if I were to choose $f(w)$ to be a Gaussian, say, what type of regularizer can I choose for Tikhonov Regularization to pick the right shape for my estimate of $w^*$?

For example, if I know $f(w)$ exactly, is it possible to design a regularizers that picks up the right $w^*$ or a good approximation to it?

The only algorithm that I thought could be good is the following:

$$ \min_{w,\alpha}\frac{1}{N}\| Xw - y \|^2 + \lambda \sum^D_{d=1} \alpha_{d} | w_i |^2$$

and then minimize wrt to both $w$ and $\alpha$. Is this a problem that has been studied before?

Note that $\lambda$ is a hyper parameter.

My suspicion is that there is something wrong with the optimization problem I suggested because it should be able to recover sparsity in principle since it just needs to set $\alpha_i = 0$ for non relevant coefficients. I think now what I realize is that its hard to optimize $\alpha_i$ because the problem was introduced by having a system that is under constrained and introducing MORE parameters in the regularizer brings more problems not less, since regularizers are suppose to bias solution while my suggestion does not. Is the only way to set $\alpha_i = f(i)$ or something like that? Seems rather unsatisfying to hard code it...seems too strong of a prior, why even learn then? What if $f(\dot)$ is moved around (say move the Gaussian around).

Maybe $f$ is not known exactly but the general shape is known.

Are there any feature selection algorithms that might be good idea to try?

Note that if the system is full rank then I assume one should always get the perfect solution without needing to include priors, so I assume the case that it interesting to ask about is $N < D$ case.

It is safe to assume that $X$ is fixed.