Second impressions on Claude

It has been two weeks since I first wrote about Claude, and I have some very mixed feelings about it. First, the positives. Claude (Opus) manages to do what even many senior developers fail to do: read and understand code. It is in many ways refreshing to interact with an agent that is able to read documentation and figure out how an API is supposed to be used. Perhaps it is the relative novelty of this talent that makes it seem so magical. And it does so at a speed that clearly reveals it is not backed by a room full of humans somewhere in Asia, so the technology is real. Yet the agent also has the very human-like quality of anticipating what it ought to do next, when only given a high level intention. Its suggestions for next steps are often completely in alignment with what I was about to ask it to do. Again, feels like magic. And finally, it is refreshing to be able to converse with an entity capable of assembling properly formed sentences, with correct punctuation and spelling, at least most of the time.

Now, the failings. It is actually quite limited in what it can do at a high level. Given a high level task direction, it can put together a cogent plan of action and execute, but only for so long. I find myself mindlessly approving its tool use requests sometimes, and upon snapping back into paying attention to its progress, find it has veered off in a strange direction, or it is checking things it had already checked without realizing it.

Claude is also somewhat sycophantic, and states its accomplishments with superlatives. Perhaps this is a product of me being overly polite, but it feels like an entirely unnecessary level of editorializing on its part. And, while it is capable of constructing well-formed language, it does have a very distinctive punchy style that becomes tiresome to read. And I, with my mirror reflex, tend to adapt my speech patterns to match it after prolonged interactions.

I will repeat an observation that others have pointed out on HN: long sessions at max context usage seem to degrade its performance, much like sleep deprivation. In a roughly 18-hour session spanning 3 days, I oversaw a model validation request that touched on debugging, design work, and strategizing. What started off as a very strong performance on the first two days deteriorated precipitously on the third day, with Claude more or less abandoning its task at hand, summarizing its progress, and signalling to me to "enjoy my break". It does happen to be the 4th of July, so I'm not sure if it is aware of that, but it seemed like a very abrupt change of demeanor. Subsequent interactions produced actions on its part akin to severe sleep deprivation: incorrect usage of simple tools like grep, going down unrelated rabbit holes, and asking for uncharacteristically more feedback. This was after the second or third conversation compaction event, and operating at 100% context usage for quite some time. So it seems like it is best to limit sessions to a single workday, lest the agent becomes exhausted (anthropomorphizing intentional).

My interactions with Claude are forcing me to map out how my job and workflow will look in the near future. A few things are shaping up:

I will be writing less code, overall. I will likely still write specs, API signatures, and tight kernel logic, but much of the "connective tissue" will be supplied by agents.
Agents can automate absolute drudgery like numerical model correspondence, but it is not substantially faster. I (at least currently) have to babysit it every step of the way to inject my intuition about what could be leading to discrepancies, and to ensure it does not go off the rails investigating completely unrelated issues or aspects it had already analyzed. It is in the end faster because it is capable of just doing the task, whereas I would probably put it off because of its unsavoriness.
Agents are surprisingly good at answering interdisciplinary open questions. For the longest time I maintained a list of "open problems" for myself, that, while not necessarily falling outside the boundaries of human knowledge, certainly did not have readily available answers in the literature. Claude has managed to solve two such problems for me thus far, so I will need to dig up others to throw at it.

And I suppose that last point most clearly illustrates its capabilities: it will find you an answer, perhaps even the answer, when it can be reasonably expected to be one. For software architecture problems, that means designing a framework given an ontology. But I suspect it would be incapable of producing a cogent ontology de novo. Perhaps that is the next thing to try.