Catastrophic Goodhart: regularizing RLHF with KL divergence does not mitigate heavy-tailed reward misspecification

Item #:
079017-0466

Details

Description

 

Members/Attendees

 

Tab 4