economic_finance4005 wordsRead on Arc Codex

Illegible benefits

Listen to this essay 25 minute listen A note from the author: I wrote this essay with Claude. The drafting was a collaboration, and so were several of the ideas, since AI tools have changed how I work in ways that bear directly on the argument. Before 1846, the scope of what a surgeon could attempt was bounded by what a conscious patient could endure in minutes. Patients were held down by assistants or strapped to the operating table (my mother’s tonsillectomy in early 1960s Sicily was still performed more or less this way), and the surgeon’s reputation depended on the economy of their movements, because every second of the procedure was a second of conscious agony. Robert Liston could amputate a leg in under 30 seconds, because he had to. Operations were confined to the body’s surface, to amputations and the drainage of abscesses and the excision of superficial tumours, because no patient could withstand sustained work inside the thoracic, abdominal or cranial cavities. When ether and chloroform arrived, some of the most respected figures in American and British medicine argued that rendering patients unconscious was a grave clinical error. John Pollard Harrison of the Medical College of Ohio, then vice-president of the American Medical Association, wrote in 1849 that ‘pain is curative – the actions of life are maintained by it – were it not for the stimulation induced by pain, surgical operations would more frequently be followed by dissolution.’ Charles Meigs, who held the chair of obstetrics at Jefferson Medical College in Philadelphia, treated labour pains as a desirable, salutary and conservative manifestation of the life force. Other surgeons pointed out that conscious patients confirmed the surgical site, assisted in decisions during the operation, and provided real-time diagnostic feedback that would be lost under anaesthesia. Safety concerns intensified as reports accumulated, culminating in the Royal Medical and Chirurgical Society’s 1864 committee report, which catalogued 123 chloroform deaths. In the early years, the opposition to anaesthesia was widespread and grounded in the best available clinical evidence. What none of these critics could have told you was that anaesthesia would make possible open-heart surgery, organ transplantation, neurosurgery, and the entire architecture of modern surgical specialisation. The benefit was not a more comfortable version of what surgeons had been doing. It was the appearance of a possibility space whose contents were inconceivable from inside the surgical practice of 1846. Cases of this kind have a particular structure. An innovation arrives whose costs are perfectly legible, since they are visible against the baseline of the practice it is reorganising and can be measured with the practice’s existing instruments, while its most profound benefits depend on practices, institutions and concepts that do not yet exist at the moment of evaluation. Serious critics with domain expertise document the costs, often correctly, since costs of that kind are precisely what the evaluative vocabulary they have inherited is built to measure. There is no possible evaluation in the critical moment that could capture the benefits, because the vocabulary that would describe them is itself one of the things the innovation is going to bring into being. Take dating apps. Eli Finkel and others have shown that app-mediated dating can produce choice overload, weaken commitment formation, and reduce the development of relationship skills that older modes of meeting cultivated more naturally. The empirical literature on loneliness, declining marriage rates among the young, and the collapse of casual in-person romantic initiation is now substantial. The critics are documenting real things, and their evidence is solid in just the way Harrison’s evidence on anaesthesia was solid. What this evidence cannot capture is the set of effects that have emerged through the existence of the apps themselves and that nobody arguing about Tinder a decade ago could have specified. Public space itself has been quietly transformed, since the existence of an alternative channel for romantic initiation has weakened the normative status of an in-person approach to the point where women navigate streets, cafés and workplaces in ways that would have been unrecognisable in 1995. Some of what used to be ambient harassment in physical settings has been redistributed onto platforms where it is at least more controllable, since digital interactions can be filtered, blocked, reported and audited in ways that street and workplace harassment cannot. In 1906, John Philip Sousa argued that the phonograph would destroy amateur musicianship The more profound effects go beyond that. Using a formal matching-theory model and state-level data on broadband adoption as a proxy for online dating, the economists Josué Ortega and Philipp Hergovich in 2018 showed that online dating is consistent with the rapid increase in interracial marriages in the United States over the past two decades. Since people used to marry within their existing social graphs, and those graphs were racially segregated, any channel that connected strangers across graphs would be expected to produce social integration, even at modest adoption rates. Drawing on a nationally representative longitudinal dataset, the sociologist Michael Rosenfeld and colleagues showed that, from around 2013, the most common way for American heterosexual couples to meet was online, and, by 2017, about 65 per cent of same-sex couples met that way, while breakup rates and relationship satisfaction do not differ meaningfully between app-met and friend-met couples once you take into account how long the couple has been together. Other emergent effects, less easily quantified but well documented, are equally hard to specify in advance. A 2024 survey from the Trevor Project reports that roughly three-quarters of LGBTQ+ young people go online to connect with others because they find it difficult to do so in their daily lives, and a similar proportion say they can be their complete selves online, with both rates rising still further among transgender and nonbinary respondents. Meanwhile, only about half of those same young people find their school to be gender-affirming and only a little over a third say their home is. The peer-reviewed work consistently finds these online spaces especially important for rural young people without access to an offline community. The proliferation of fine-grained sexual and relational identities depends on infrastructure that lets people with rare combinations of traits find each other across continents rather than within local social graphs. Dating apps are a useful contemporary case because the relevant timescale is short enough that we can see new practices coming into being, but the structure they exhibit has played out repeatedly across the long history of innovations of this kind; looking at the pattern at greater historical depth makes its contours easier to see. Take the phonograph. The American composer John Philip Sousa, writing in Appleton’s Magazine in 1906 from the absolute centre of his musical authority, argued that the phonograph would destroy amateur musicianship. He was empirically right about the disappearance of the parlour piano over the following half-century. But the recording studio as a creative instrument, as well as sampling, multitrack composition and entire genres that exist only because they are compositionally inseparable from recorded sound were all unsayable in 1906, since they depended on practices, technologies and aesthetic categories that the recording medium itself had to bring into being before anyone could describe what they would amount to. The pattern is also visible in the infrastructural cases, where the displacement is more diffuse. When gas engineers in the 1880s argued that early electric lighting was inferior to gas, they were correct about the comparison they were making. What the comparison could not capture was that the infrastructure being built to deliver electric light, the generators and transmission lines and distribution networks, would in time become the substrate on which a much wider electronic civilisation would be built, including telephony, radio, computation and broadcasting. The path from electric lighting to a transistor is long, runs through inventions and disciplines that did not exist in 1885, and would not have been predictable from inside the 1880s lighting debate. People who know they can look up something on Google develop weaker memory for the information itself The argument is actually older than any of these cases. In the Phaedrus, Plato had Thamus warn that writing would corrode memory, and the warning was not wrong about what it was looking at, since the displacement of memorised oral traditions by written records is a real cultural transformation. What Thamus could not warn against, because the words for it did not yet exist, was the cumulative scholarship that depends on stable written reference, the formal logic that requires symbolic notation to be developed at all, and the empirical science whose evidentiary practices are inseparable from the long-form documentary record that writing alone makes possible. The same pattern is now being argued about in real time around AI, and the mechanism the critics are documenting is the one Thamus already feared. When a tool reliably performs a cognitive operation, the internal capacity for that operation tends to weaken with disuse, and the contemporary version of the worry has empirical support that goes well beyond the Phaedrus. People who know they can look up something on Google develop weaker memory for the information itself, and habitual GPS users show measurable decline in hippocampal-dependent spatial navigation. Large language models automate cognitive operations of considerably greater scope than route-finding or trivia recall, and there is no principled reason to expect the same mechanism will fail to scale. A growing body of evidence suggests that this is exactly what is happening. For a 2025 preprint, the research scientist Nataliya Kosmyna and colleagues monitored brain activity during essay writing, and found that participants who wrote with ChatGPT showed significantly weaker neural connectivity than those who wrote unassisted, with the effect persisting when those participants were later asked to write unaided. The AI researcher Hamsa Bastani and colleagues ran a field experiment with nearly 1,000 high-school students and found that those given a standard AI tutor performed 17 per cent worse on the final unaided exam than those given no AI at all, while a version of the same tool designed to scaffold learning rather than supply answers produced no such deficit. In a randomised trial of software engineers learning a new Python library, the AI researchers Judy Hanwen Shen and Alex Tamkin reported that those who used an AI coding assistant scored substantially lower on debugging and conceptual-understanding questions. And research by Elena Hayoung Lee and colleagues found that passive AI use, in which participants copied AI output directly, undermined their confidence in their own abilities and their sense of meaningfulness in the work, while active collaboration, in which they drafted first and refined with AI, preserved both qualities. These are serious findings, and the people who produced them are doing careful work. They are also the AI debate’s version of Harrison on anaesthesia, looking at the most comprehensive evidence base available to them while being unable to see the most important thing that will happen next. AI exhibits this legibility asymmetry in extreme form, and it is plausibly the most thoroughgoing case of it the world has yet seen. I want to turn now to my own practice, and to the question of what I have actually found after two years of working with the tools that the evidence says are degrading my cognition. One philosopher’s experience is a weak evidential basis for anything, and I am wary of leaning on mine after spending half this essay insisting that we attend to evidence with care. But the shape of the experience, if not its generalisability, bears directly on the argument. What I expected when I started was the obvious stuff – faster literature review, cleaner first drafts, less time spent on the mechanical parts of academic writing. Those gains materialised, and they turned out to be the least interesting part of the story. Before I used these tools, the cost of exploring a question was high, since I would notice something that seemed promising, spend several days reading around it, sketch a preliminary argument, discover a problem, and after perhaps two or three weeks arrive at a verdict on whether the question was worth pursuing. The cost of discovering that a question was a dead end was, in practice, indistinguishable from the cost of discovering that it was genuine, and you had to do most of the work before you could tell the difference. Anyone who has spent three weeks developing an argument only to discover, on a Tuesday afternoon in 2026, that someone published essentially the same idea in 2019 will recognise the particular quality of that experience, since the sunk weeks do not feel like useful learning, they feel like waste, and the knowledge that the next question might go the same way makes you less willing to start. That cost structure made me conservative about which questions I pursued and reluctant to abandon any question I had invested weeks in, which I now recognise as a textbook sunk cost bias operating on my research agenda but which at the time felt like conscientiousness. What changed was that the cost of preliminary exploration collapsed. I could sketch an argument, identify the first serious objections, test whether they were fatal, and reach a provisional verdict in an afternoon rather than a fortnight. This sounds like a simple acceleration, and the more profound effect was on what I was willing to abandon. Dropping a question after an afternoon’s work feels nothing like dropping one after three weeks. When the exploration costs are low, the sunk cost attachment disappears, and you find yourself dropping bad questions earlier and more often, which means the questions you keep are better. I explored far more ideas, and my working portfolio became both larger and better curated. I arrived at this outcome not through any deliberate plan but simply through sustained engagement with a tool that changed what exploration cost. Nobody told me that working with LLMs would reorganise my research practice around finding good questions The skill that improved most, and the one I would never have thought to look for, was something I can only describe as question-identification – the ability to find problems that are both tractable and important. This is the thing an academic career is substantially built on and which nobody, so far as I know, has ever tried to teach directly. I want to be honest about the costs. My ability to hold together a complex position verbally, under pressure, in a seminar or a conversation, has probably not improved and may have declined somewhat. When preliminary exploration is cheap, you spend less time grinding through arguments from first principles, a grinding that builds fluency that shows up in live exchange. Friends have pressed me on this, and they are right to worry. The shape of the disagreement is itself instructive, because the cost is immediately describable as a subtraction from a capacity I have been exercising for years, while the benefit was prospectively invisible and retrospectively obvious. I did not predict it, and I could not have named it in advance. The faster literature review and cleaner first drafts I had hoped for in 2023 turned out to be roughly as interesting as Thamus’s concession that writing would be a useful aide-mémoire. The fact that I can articulate this benefit at all is, in a sense, evidence against the strongest version of my own argument, and I want to be upfront about that. Anything I can now name has already crossed from the illegible into the legible. But nobody told me that working with LLMs would reorganise my research practice around finding good questions rather than producing answers to the ones I already had. There is a further wrinkle here that I find harder to set aside, which is that the language I have for describing what these tools have done to my thinking is itself partly a product of working with them, and the reader who is evaluating my testimony is doing so from a standpoint that is, in its own way, in the process of being reshaped by the same technologies. None of this excuses me from the obligation to articulate what I can, and it does mean that the asymmetry I have been describing is not something we are looking at from outside but something we are inside, including right now. This has consequences for governance. The asymmetry is not just a problem for individual users deciding how to engage with AI, since it shapes the position of every governance actor whose institutional role requires grounding decisions in documented evidence. A regulator who must demonstrate due diligence will always find documented costs more defensible than the possibility of unmeasurable benefits, because costs are evidence and possibilities are not. The same blind spot can be seen at the level of formal theory. Even careful formal modelling can reproduce the blind spot. In a recent working paper, Daron Acemoglu, the 2024 Nobel laureate in economics, working with Dingwen Kong and Asuman Ozdaglar, builds a mathematical model of what happens when people increasingly rely on AI for decisions that they used to work through themselves. The model captures a real and important mechanism. When people figure out things on their own, they generate knowledge as a byproduct, knowledge that flows into the shared pool that everyone subsequently draws on. AI assistance gives each individual a better immediate answer while quietly removing the cognitive labour that produces the shared pool, and the model shows that under plausible conditions this can tip into a steady state where the collective knowledge base contracts even as individual decisions remain locally rational. The recommended policy is to deliberately constrain AI precision, forcing people to keep doing the work themselves so that the public knowledge stock continues to be replenished. The model is elegant and the mechanism it describes is real, but the categories of knowledge it tracks are fixed from the start. There is no representation in the model of the possibility that sustained engagement with a capable system might generate qualitatively new kinds of competency that the prior categories cannot describe. The model can represent a world where what needs to be known keeps changing while being unable to represent a world where what counts as knowing changes. That is a substantive limitation, since the latter is precisely what is happening when an innovation reorganises the practices in which knowledge is constituted. The empirical evidence points in the same direction, since the best outcomes in the existing literature were produced by varying the conditions of use with a capable system, through safeguarded interfaces and thoughtfully designed engagement, rather than by degrading the system’s capabilities. The restriction case may well prove correct. The theoretical framework behind it is incomplete in a way that favours restriction, whether or not restriction is warranted. A reader who has followed the argument this far might reasonably worry that the asymmetry I have described runs in both directions, since the developmental trajectory of mature human reasoning is itself emergent, and capable, over time, of producing forms of judgment that nobody could have specified in advance. The reasoner of two centuries ago could not have anticipated the conceptual repertoire of the reasoner of today, just as the surgeon of 1846 could not have anticipated cardiac surgery, and one might therefore think that lifting the constraints under which mature reasoning has developed forecloses an illegible-benefit possibility space of its own. This objection is sophisticated, and I do not think it succeeds, because it conflates two different kinds of emergence. The refinements that mature reasoning undergoes in the absence of paradigm-changing technology, however unpredictable in detail, lie along a continuous trajectory whose direction is recognisable from within the existing conceptual vocabulary, and a careful reasoner of two centuries ago would have recognised the reasoner of today as occupying a more sophisticated point on the same path she was walking. Paradigm-changing technology produces emergence of a different kind, since it opens a possibility space that lies outside the trajectory of the practice it reorganises, a space whose contents cannot be specified from within the practice’s existing evaluative vocabulary, and the legibility asymmetry I have described is specifically the asymmetry that obtains in those cases. The problem is compounded by a feedback loop. Restriction forecloses the experimentation through which benefits might have become visible, so that the evidence base after restriction continues to look one-sided and the original decision appears vindicated. A university that bans AI from student writing will never discover what ways of teaching with AI might have preserved the hard thinking that actually produces learning while enabling new forms of intellectual work, and the absence of that evidence will make the ban appear wise. If we cannot estimate the benefits, we need a different test for these decisions. The one I think holds up is recoverability. The question is whether, if a decision turns out to be wrong, we can take it back. The asymmetry I have described provides no reason to relax caution and may strengthen the case for restriction That test sounds simpler than it is. What looks recoverable in one respect may turn out not to be in another. A cognitive capacity might come back through retraining even after years of disuse. But the institutions that make retraining possible, the programmes and mentorship pipelines and the working communities through which the relevant skills are transmitted by doing can dissolve on their own timetable, independent of whether anyone could in principle relearn the underlying skill. And once those institutions go, the people who knew how to run them tend to scatter, and the knowledge of how to put them back together goes with them. When the US Navy dropped celestial navigation from the curriculum after GPS made it redundant, it took roughly a decade for the vulnerability to become salient, as awareness grew that GPS could be jammed, spoofed or disabled in a first strike. The subject was reinstated in 2015, and recovery was possible because the US Merchant Marine Academy at Kings Point, New York had continued teaching it, instructors who retained the skill still existed, and the textbooks were still there. The principle the US Navy learned, and that the aviation industry arrived at independently, is that recovery requires preserved redundancy somewhere in the system, an institutional locus where the displaced capacity continues to be cultivated, so that the trade can be reversed if it looks worse in retrospect than it did in advance. Where costs are plausibly irrecoverable, where developmental windows close or the ability to detect a problem disappears alongside the skill it depends on, the asymmetry I have described provides no reason to relax caution and may strengthen the case for restriction, since there is nothing in the evidence base that can be weighed against a loss that cannot be undone. Where costs are plausibly recoverable through feasible institutional design, then restricting on the basis of visible costs alone forecloses benefits whose magnitude cannot be estimated, and the foreclosure is the greater risk precisely because the benefits foregone are those whose scale the evidence base cannot access. The most important assessments of these technologies will be made retrospectively, by people working with concepts we do not yet have. The question is how to govern in the meantime, when costs are legible and the benefits that matter most depend on practices we have not yet built. Uniform restriction is a bet that we already know what AI-augmented professional competence will look like, placed at a time when the evidence strongly suggests we do not. Managed experimentation, with recoverable costs contained by institutional design and the burden of proof on those who would foreclose rather than those who would explore, is a different kind of bet, and the asymmetry I have described gives us reason to prefer it. Harrison could tell you with precision what anaesthesia would cost. He could not have told you about open-heart surgery, because the practices from which such a thing could be imagined did not yet exist. Measurability and importance are not the same thing.

How it works

Once you click Generate, Ollama reads this article and crafts 5 comprehension questions. Your answers are graded against the article content — general knowledge won't be enough. Score 70+ to count toward your certificate.

Questions are cached — you'll always get the same 5 for this article.