Search Results

Shopping News / Articles

lesswrong.com
lesswrong.com > posts > 5PCvd2FYxhCuqBdQw > my-hammertime-final-exam-1

My Hammertime Final Exam — LessWrong

1+ hour, 50+ min ago (469+ words) Firstly, I finally made it :~D It's my second attempt, firstly I tried to finish Hammertime around a year ago. I even forgot I had a LessWrong profile since, so here I am, writing my first post. I kinda got…...

lesswrong.com
lesswrong.com > posts > ev2iXMfeecxZkw8HR > key-to-life-no-9-access

Key to Life No. 9: Access — LessWrong

2+ hour, 8+ min ago (954+ words) There is now an enormous amount of incredibly useful information in the world. But at the same time, there is also a problem of access to it. On the one hand, access to knowledge is now better than it has…...

lesswrong.com
lesswrong.com > posts > amYmcwCuyuCEZcrRm > understanding-when-and-why-agents-scheme

Understanding when and why agents scheme — LessWrong

3+ hour, 29+ min ago (232+ words) We consider the behaviors studied here as potential precursors'to the full threat model of scheming: current agents may sometimes behave consistent with scheming, but do not (yet) have the coherent long-term goals and the general capability that would make their…...

lesswrong.com
lesswrong.com > posts > dMshzzgqm3z3SrK8C > the-hot-mess-paper-conflates-three-distinct-failure-modes

The Hot Mess Paper Conflates Three Distinct Failure Modes — LessWrong

21+ hour, 3+ min ago (655+ words) Anthropic's recent "Hot Mess of AI" paper makes an important empirical observation: as models reason longer and take more actions, their errors become more incoherent rather than more systematically misaligned. They use a bias-variance decomposition to show this, and conclude…...

lesswrong.com
lesswrong.com > posts > DzKhPKjyYSy5oih2u > the-future-of-aligning-deep-learning-systems-will-probably

The Future of Aligning Deep Learning systems will probably look like "training on interp" — LessWrong

1+ day, 56+ min ago (1014+ words) Epistemic Status: I think this is right, but a lot of this is empirical, and it seems the field is moving fast I should start by saying that this is dangerous territory. And there are obvious way to botch this....

lesswrong.com
lesswrong.com > posts > uix7mr2DyjeJ5pmaL > an-agent-autonomously-builds-a-1-5-ghz-linux-capable-risc-v

An agent autonomously builds a 1.5 GHz Linux-capable RISC-V CPU — LessWrong

1+ day, 58+ min ago (632+ words) A project from Verkor, a chip design startup. "Verkor is working with multiple of the top 10 fabless companies to deploy DC(Design Conductor; their AI agent for chip design) to accelerate their time to market".I wonder how impressive this…...

lesswrong.com
lesswrong.com > posts > WFkPrPy2r27rknLtw > untrusted-monitoring-extra-bits

Untrusted monitoring: extra bits — LessWrong

1+ day, 2+ hour ago (954+ words) The following are some further notes related to untrusted monitoring I had while working on our untrusted monitoring paper. The sections are mostly independent of each other. In some of our experiments we looked at the situation where the trusted…...

lesswrong.com
lesswrong.com > posts > QrPxNDprJwXiE73Py > finding-features-in-transformers-contrastive-directions

Finding features in Transformers: Contrastive directions elicit stronger low-level perturbation responses than baselines — LessWrong

1+ day, 2+ hour ago (591+ words) Note: This is a research update sharing preliminary results as part of ongoing work. Across all combinations, we observe that perturbing along the difference-of-means direction breaks through the suppression region at much smaller magnitudes than perturbing along random directions or…...

lesswrong.com
lesswrong.com > posts > ixyokbwQEHgiHJYFW > confusion-around-the-term-reward-hacking

Confusion around the term reward hacking — LessWrong

1+ day, 7+ hour ago (599+ words) The term[1]commonly points to two distinct phenomena. I refer to them as misspecified-reward exploitation and task gaming.[2] Importantly, this can't be diagnosed in isolation from a training process. It applies to a model that performs undesired behavior which achieves…...

lesswrong.com
lesswrong.com > posts > LnJeXLY2Y2Au97dfL > arena-7-0-impact-report-1

ARENA 7.0 Impact Report — LessWrong

1+ day, 6+ hour ago (1218+ words) The impact report from ARENA's previous iteration, ARENA 6.0, is available here. ARENA 7.0 took place at the London Initiative for Safe AI (LISA) between January 5th and February 6th, 2026. The purpose of this report is to evaluate ARENA 7.0's impact according to ARENA's…...

Shopping

Please enter a search for detailed shopping results.