threat_intelligence3319 words

Don’t Let Perfect be the Enemy of Portable

AI Policy & Governance, CDT AI Governance Lab Don’t Let Perfect be the Enemy of Portable This essay first appeared on CDTer Miranda Bogen’s Substack Elicitation. You can also check out Kevin Bankston’s Substack Converger here. Last November, after over a year of regularly using ChatGPT, I was ready to give Google’s Gemini and Anthropic’s Claude a try instead. The latest versions of both models had just dropped, and everyone online was hyping them as better on a variety of benchmarks compared to OpenAI’s flagship. I’d played a bit with each of those other models before, but decided it was time to truly give them a shot, and compare their performance on my most important projects to what ChatGPT had been giving me. The problem was…I couldn’t. Data portability, from social networks to AI “Data portability” is a concept that got a lot of play in tech policy circles in the late twenty-teens, as many users had second thoughts about having their social graph locked into Facebook. Many policymakers and advocates (including me) were thinking about how to overcome that service’s seemingly insurmountable network effects and enable real competition in the social network space. People and policymakers alike wanted it to be easier to download and move not only one’s social network posts and comments but also your whole network of contacts from one service to another. That pressure led to a whole lot of new “download your data” features from the big players like Facebook and Google, as well as the codification of a right to data portability in Article 20 of the EU’s General Data Protection Regulation (GDPR). But because of differing data formats and social network service architectures, and a lack of importability features for easily moving downloaded data into other services, the dream of actually being able to transfer your digital life wholly from one social network site to another never materialized. (There is actually a corresponding obligation in GDPR Article 20 for services to enable direct transfers between services, not just downloads, but so far no regulator has made an issue of it.) The push for portability of social network data therefore came a bit too late to have very much impact on competition (although the launch of more open and decentralized networks like Mastodon and Bluesky can trace their ideological and technical roots in part to that push). And now, as social networks have moved toward feeding you “unconnected” content—that is, content not from your social connections—it’s become easier for an upstart like TikTok to draw eyeballs away from the older networks without needing to import those social connections. For the most part, then, the idea of portability has passed the social network space by—only for the issue to now rear its head in regard to the latest tech transformation, around LLMs and other generative AI services. Bridging the AI portability gap After over a year of using my AI tool of choice, ChatGPT, the service now stored hundreds of chats, and hundreds of generated documents and images, related to some of my most important and in-depth projects both personal and professional. For me to make a meaningful apples-to-apples comparison between its performance and that of the competing models, much less actually make a permanent switch, I’d need to be able to port at least a few of those projects and the dozens of threads that lived within them to competing services in order to test out their models. But there was no easy way for me to do that. The only option was to laboriously cut and paste each old chat from the old model into a new chat in the new model, and manually download and reupload all the files I’d previously uploaded to or generated in the old chats. Sure, because of the GDPR, all three services did provide downloads of your data in bulk. But there were several reasons why that was of little help in terms of actual portability, and mostly made for EU compliance theater. First: All of them offered downloads as large JSON files (JSON is JavaScript Object Notation, a lightweight, machine-readable, text-based data exchange format); Google also offered the same data in an HTML file, easy for human readers to parse. But none of them published the all-important schema for those JSON files—essentially, the digital Rosetta Stone for parsing the format of the data. And based on online commentary from independent researchers who were digging into these files, those undocumented schemas were not stable and could change on a dime, potentially reflecting ongoing changes in the company’s internal data schemas for these fast-changing products. So of course there also was not (and is not) any standardized JSON format used across the companies, which would have made it much easier to engineer corresponding importability tools for uploading the data into another service. Second, and as a result, there were no import tools! And so there was no way for me to take the file I could download from OpenAI and then upload it to Google or Anthropic (or Meta AI or X’s Grok or my open source model of choice or whatever). Portability denied! Third and finally, not least because it was hard to parse these undocumented JSONs, it wasn’t clear that they actually contained all of my relevant chat context, including cross-chat memories that the model had developed around my use. The only exception seemed to be Anthropic, which disavowed having any saved memories beyond those that were in a user-accessible text file, which was included in its own distinct JSON. Obviously, a single standard would solve this. That longer-term strategy has been the primary focus of our friends at the Data Transfer Initiative, a nonprofit consortium founded to help foster data portability, including for social networks, and is now increasingly focused on AI portability and therefore interoperability between different AI services’ data formats. DTI even found an AI company, Inflection AI, to partner with it in piloting a standardized, documented AI export format. However, adoption of that sort of joint standard seems very unlikely in the short-term, again because these undocumented JSON schemes likely reflect fast-changing internal data architectures and ever-evolving feature sets, with increasing differences between services. And so long as they’re meeting their baseline GDPR obligation to offer machine-readable downloads of your data, and absent real pressure from users and policymakers, there’s little incentive for collaboration between the labs on this issue. What is a user dreaming of portability to do? Perhaps LLMs themselves could offer the answer to this problem: they are excellent at parsing undocumented data, making even the least-structured data meaningfully machine-readable. Why couldn’t the labs build importability tools using their own LLMs to document and therefore make portable the undocumented JSON schemas used by other labs? Did we really need to let the perfect be the enemy of the good by waiting for a standard, or for widespread deployment of AI Application Program Interfaces (APIs) and Model Context Protocol (MCP) servers to create the plumbing necessary for direct portability between services? Couldn’t we turbo-charge portability by using LLMs themselves to enable quick-and-dirty importability of downloaded JSON archives without a standard and without that network plumbing? If anything, the existence of LLMs and their ability to parse unstructured data should enable a golden age of cross-application portability between all kinds of software! Indeed, I could—using a coding agent like OpenAI Codex or a desktop agent like Claude Cowork—direct a chat to look at a folder holding my downloaded JSON exports. Then the LLM could probably figure out the JSON schemas well enough to answer questions about the chats I’d had, or pull out particular chats into their own docs, at least until the thread’s context budget ran out. (I in fact tried this using Cowork on my OpenAI and Gemini archives and it worked pretty well!) But it still wouldn’t be able to do what one would expect from a true portability solution: actually reproduce all of those chats in their entirety within the importing service’s interface as separate chats. If I wanted that, I’d have to return to the unsatisfying solution of cutting and pasting each individual old chat into a new chat thread. But…perhaps the labs themselves could build an LLM-based import tool that would do that. Armed with this idea—which was admittedly a pretty obvious one once you thought about it, but which none of the companies seemed to be overtly pursuing—I began making inquiries, both on public social networks and through my private work networks. Could LLM-based parsing tools be the answer for easy AI importability without having to wait for standards, turning these lame duck JSONs into something more useful? (FWIW, when asked, ChatGPT, Gemini and Claude all agreed it was plausible, and Claude even made me a nifty interactive demo.) Screenshot from a four-step interactive demo of an imagined import feature user interface that Claude coded for me, including thread-by-thread import. If such importability features were indeed workable, were the companies prioritizing building them, and if not, why not? It seemed that Google and Anthropic in particular had a strong business reason to do so, considering OpenAI’s substantial lead in terms of user acquisition, and the fact that many of its longer-time users—like me—were probably feeling locked in by their extensive chat histories. But other than a few unofficial experiments (including a recent hackathon organized by DTI) and attempts at third-party open source tools for cross-service AI context sharing like https://onfabric.io/’s context-use utility, it didn’t seem like anyone was prioritizing it. So, I and my colleagues at the Center for Democracy & Technology’s AI Governance Lab, where we work to promote strong standards and best practices around AI safety, transparency, and accountability, decided to prioritize it ourselves. Digging into the data ourselves We at CDT wanted to know what was actually inside the user data downloads being offered by AI labs—and whether there was a real path from these undocumented JSON dumps to something a user could meaningfully use elsewhere. So, with input from our allies at the Data Transfer Initiative, I and code-savvy CDT fellow Jordan Gasior-Kavishe partnered with a software-engineer-turned-law-student, Shelby Slotter, on a spring semester research project for Professor Paul Ohm’s “Coding for Lawyers” class at Georgetown Law. The premise was straightforward: take the available data downloads from the major services, point an open-source LLM at them, and see how far you can get. Can the model reliably infer the schema from a sample? Can it document those schemas in a stable, human-and-machine-readable way so others can build on the work? Can it translate between formats, even imperfectly, into something resembling a common interchange format—perhaps along the lines of what the Data Transfer Initiative has been proposing for cross-service AI data flows? We’ll be publishing that technical research later this summer. But the headline finding, unsurprisingly, is that this is technically very doable. You don’t need OpenAI’s or Anthropic’s or Google’s cooperation to adequately parse their JSON for importability. You just need a competent LLM to crunch on a slice of the exported JSON to figure out the schema, and a willingness to keep monitoring new exports over time to identify any schema drift you’d need to adjust for. Which raises the obvious question of why well-resourced AI labs with literally the world’s most capable LLMs sitting in-house couldn’t have done the same thing months or even years ago. The answer, of course, is that they could have. They just didn’t consider it a priority, even though—again—it’s literally required by EU law. Thankfully, though, as our research was progressing during the spring semester, those priorities finally started to change. The introduction of Potemkin Portability for AI Because of its simple and legible memory format, Anthropic was the first to offer what I would call Potemkin Portability (after so-called Potemkin villages, fake facades meant to hide an unpleasant reality). In early March, Anthropic rolled out memory to all users, including its free tier, including a memory import feature. The mechanism, however, is less “import” than “translation by prompt.” Users can paste a standardized prompt into ChatGPT or Gemini that instructs the source model to spit out a structured summary of everything it knows about you, then you paste that summary into a Claude import page. Claude doesn’t read the competitor’s JSON file at all. It reads a short text blob that the competitor’s own model generated about you in response to a prompt. This is clever, and credit where due—it’s the sort of pragmatic, LLM-mediated workaround I’d been asking about. But it isn’t actually data portability. It’s the AI version of asking your doctor for your medical records and instead getting a one-page summary typed up from their memory. You get the highlights, in someone else’s voice, with whatever the source service decided was worth surfacing, and which may or may not actually be an accurate summation that emphasizes what’s most important to you. None of your actual conversations come over. None of your projects come over. None of the uploaded or generated artifacts come over. What you get is a thin description, generated by the service you’re trying to leave, of what it thinks you care about. Google soon followed suit with its own memory-import-by-prompt feature. But that feature was accompanied by another one that finally delivered some true importability. Google makes the first move on real portability…who will make the last? On March 26, Google announced that the consumer Gemini application would now accept actual chat history imports from ChatGPT and Claude: not only memory summaries (which, like with Anthropic, you could cut and paste in), but the real conversation logs themselves, up to a 5GB ZIP file. You upload the zipped JSON export you got from the other service, and Gemini ingests it. Your old conversations become searchable and queryable inside Gemini. And as far as I can tell, it works! I was able to upload my ChatGPT JSON file and copies of all of my ChatGPT chats quickly landed in my Gemini app, including both my side and ChatGPT’s side of every conversation, and with a little document icon indicating which chats in the sidebar were imported and which were Gemini originals. It’s not entirely clear yet how Google’s chat import function works under the hood. Is Gemini parsing each new JSON using a previously reverse-engineered schema they’ve banked internally? Is it doing inference-time schema discovery with the model itself, the way I had been asking about? Some combination of the two? How does Google keep the import tool from breaking the next time OpenAI or Anthropic silently changes its export format? Google hasn’t said, and I’d be curious to know. What I can say is this: it’s the first real import tool from any major lab, and it deserves credit as such. It is, finally, not Potemkin. The other labs should be absolutely racing to match it and then exceed it. But it’s also not enough. In particular: Artifacts don’t come over. Files you uploaded to ChatGPT, documents and images Claude generated for you in conversation, code interpreter outputs, charts, mockups—the actual products of all that work—don’t make the trip. For many users, including me, those are the single most valuable things in our chat history. A conversation transcript without the documents it was about is half a record. (Notably, this isn’t Google’s fault but rather OpenAI and Anthropic’s: their downloads don’t include this data. Google, on the other hand, at least includes your generated images in its download, along with an HTML version of your chats that the images plug into for human browsing.) Projects don’t come over. For anyone with a whole lot of chats, “projects” are one of the only ways to actually organize all that content. That organization disappears upon transfer. Memories don’t come over. Not the explicit memory files where the service has stored facts about you, and certainly not the implicit context—the model’s accumulated sense of how you write, what you’re working on, what you’ve already explained. Gemini’s chat history import gives you the conversations but not the understanding behind the conversations. But at least there is still the prompt-to-paste memory feature to help fill that gap. There are no thread-by-thread controls. You can’t pick and choose which threads make the trip (as in my fantasy portability demo). Nor can you say “import everything except my health-related threads,” even though both Claude and ChatGPT offer the ability to tag health-related chats for greater security. Even if a conversation was flagged as health-related in the source service, that flag is lost in the transfer to Gemini along with any special treatment. You can laboriously go prompt-by-prompt in Google’s My Activity to delete threads you no longer want (My Activity parses your activity by prompt and response, not by full chat thread, which makes this extra burdensome.) But that’s only after they have been ingested by Gemini and potentially used to train the model. In practice, that means: Google requires you to let them train on all of your imported chats from the previous service. That’s the big privacy tradeoff. If you want to import a thread, they get to train on it. If you don’t want to import a thread, well, too bad, you don’t get to pick and choose—unless you first figure out the export JSON schema and then do manual surgery on the file to remove specific chats. Considering the wide range of sensitive data that might be in your previous service’s chats, that’s a pretty raw deal for the average user. (Note that if you are using Claude’s export function, you can limit what gets exported by date—all dates vs. the last 30 days vs. the last 90 days vs. a custom date range—so you could limit what Google gets using that relatively blunt instrument. I would urge the other companies to also offer this date-limited export option.) All that said, while there are some significant gaps and caveats around what they’ve offered, Google has indeed built the real thing—actual chat portability—and the others should be embarrassed they didn’t do it first. Anthropic, in particular, has spent the entirety of its brief corporate life cultivating an image as the most responsible AI lab, and also has (until very recently) struggled to draw users away from the other AI services; getting beaten to actual portability by Google should sting. And OpenAI, as the incumbent with the most to lose from making exits easy, is predictably now the last of the three top labs to move on any portability features at all, and has the most ground to make up. What we need now is for all of them—OpenAI and Anthropic, Meta and X, and every other major service—to match Google, and then to exceed what Google has done, by preserving privacy choices at the thread level, importing artifacts and memories, and not conditioning portability on consent to training. They also need to start talking about standardizing their export approaches, a conversation that is already long overdue. To make those things happen, we need advocates and policymakers in the US and EU to keep pushing. Now’s the time to push. Don’t accept fake portability There is still time to get AI portability right in a way we didn’t manage for social networks. The technical pieces are obviously achievable as both our research and the labs’ half-hearted efforts have shown. The market case is clear: whichever challenger first offers true, complete portability from ChatGPT will give every long-time OpenAI user a reason to seriously consider switching. And the policy frameworks exist—GDPR requires real portability, while multiple state laws and federal proposals in the US may also require it—they just need to be refined, implemented, and enforced. In the meantime, we need real action from the labs themselves. Google has finally shown that one of them is at least willing to try. The rest should stop hiding behind their facades and compete.

How it works

Once you click Generate, Ollama reads this article and crafts 5 comprehension questions. Your answers are graded against the article content — general knowledge won't be enough. Score 70+ to count toward your certificate.

Questions are cached — you'll always get the same 5 for this article.