이론 - Q-learning exploit&exploration and discounted reward