In many real-world problems, the state space of MDP cannot be directly accessed but needs to be inferred from high-dimensional sensory observations. To explicitly segregate task-relevant and irrelevant factors, we propose to model the latent embedding space $\mathcal{S}$ with two components: a task-relevant component $\mathcal{S}^+$ and a task-irrelevant component $\mathcal{S}^-$. We assume that the reward is fully determined by the task-relevant component (r: \mathcal S^+ \mapsto \mathbb R), and the task-irrelevant component contains no information about the reward: $\mathrm{MI}(r_t; s^{-}_{t}) = 0$ at each time step $t$.