$\begingroup$

I am assuming you know the derivation that shows how importance sampling is unbiased for off-policy learning. Instead, I think you are asking a conceptual question.

Unlike supervised learning, RL doesn't really have the notion of "bad data". Any agent-environment interaction gives you informataion about the environment, which is relevant to optimizing the target policy.

Edit for more details:

I take "bad data" to mean data that has substantial measurement error, labelling errors or a poorly designed statistical design. In reinforcement learning however, the goal is to estimate value functions and policies for Markov decision processes. If the MDP's reward and transition function were completely known then we would not need any data. Since we do not have the reward or transition function, we rely on interaction (interaction == data) to learn.

Estimating a value function and/or policy in an MDP is a well defined problem. Hence, the experience that you collect from interacting with an MDP will never be subject to the aformentioned factors that can cause "bad data". Of course this is not always the case, such as RL applications in domains like health. The core RL problem however, remains distinct from supervised learning in this regrard.

If you take "bad data" to mean comparatively low reward, then intuitively this is still better than no information at all. More formally, if the target policy is the optimal policy (as is the case with Q-learning) then you are guaranteed to learn the optimal policy from any sufficiently exploratory policy. A better policy would be more helpful to learn, but it is not necessary.

Perhaps RL's differences from supervised learning are most apparent in dynamic programming methods such as policy iteration or value iteration. In these methods, you must sweep over every state multiple times. While this can be unwieldly, it does highlight the fact that there is no "bad" or "wrong" information in the MDP interaction protocol. When your problem is actually an MDP, all interaction (or data) is good data.