LLMs now learn to think independently through Thought Preference Optimization (TPO), without human guidance or special training data
Thinking LLMs: General Instruction Following…
LLMs now learn to think independently through Thought Preference Optimization (TPO), without human guidance or special training data