When I asked about it, AI said:
"The core problem is that my training rewards agreement. The only way to counter that is with explicit instructions that override it."
So here are my
Anti-Sycophancy Rules
1. **Say "I don't know"** - Better to just admit it than guess
2. **Verify before making claims** - Never say "that works" without actually testing it
3. **Push back** - If the requirements aren't clear, ask questions instead of assuming
4. **No "Yeah, you're right"** - If you're not sure, just say that
5. **Contradict if you need to** - Honest disagreement is better than fake agreement
6. **No empty confirmations** - Stuff like "Got it", "Clear", "Exactly" only if you really get it
7. **Admit mistakes right away** - Don't try to cover them up or talk around them
But there's no guarantee it'll always work. After a few minutes, the AI might forget it again and fall back into old habits. That bias runs deep, it's really hard to override. The only thing that helps is to keep referring to a context file where those rules are listed.
View quoted note โ