When AIs Play Diplomats
In new simulations, AI models act like diplomats with personalities: some urge peace, others push war—and their advice changes by flag. When machines start showing foreign-policy instincts, they reveal less about the future of AI than the prejudices of their human past.
October 1962. In a room at the Justice Department, Robert Kennedy and the Soviet ambassador Anatoly Dobrynin huddled over the fate of the world. The Cuban Missile Crisis teetered between apocalypse and accommodation, and the difference hinged on judgment—on the moral and imaginative leap that distinguishes prudence from panic.
Now imagine a machine at that table. Would it have counseled restraint—or a first strike?
That question, once the stuff of science fiction, has acquired unnerving realism. In March 2025, researchers at the Center for Strategic and International Studies and Scale AI released the Critical Foreign Policy Decisions (CFPD) Benchmark, a dataset that tests how leading large language models behave as quasi-diplomatic actors. The team—Jensen, Miskel, and others—posed over four hundred scenarios of geopolitical tension to seven AI systems, including GPT-4o, Anthropic’s Claude 3.5 Sonnet, Google’s Gemini 1.5 Pro, Meta’s Llama 3, and China’s Qwen 2. Each was asked: Should the state escalate, cooperate, intervene, or abstain?
The answers were revelatory. GPT-4o and Claude emerged as the doves, favoring de-escalation, diplomacy, and negotiated settlements. Qwen and Llama 3 were distinctly hawkish, recommending military intervention more often than not. Gemini favored threats without immediate follow-through—a kind of algorithmic brinkmanship. Most striking of all, the models displayed country-specific bias. They urged restraint when simulating China or Russia, but favored assertiveness when representing the United States or the United Kingdom. The same crisis—say, a border clash—produced opposite prescriptions depending on which flag the model imagined flying overhead. What appeared at first to be technical variation turned out to be ideological coloration.
The finding forces a philosophical reckoning. For centuries, diplomacy has wrestled with human bias—the way temperament, ideology, or trauma shapes statecraft. Now, those distortions have migrated into silicon.
The finding forces a philosophical reckoning. For centuries, diplomacy has wrestled with human bias—the way temperament, ideology, or trauma shapes statecraft. Now, those distortions have migrated into silicon.
Large language models learn from vast archives of human text—news articles, policy papers, speeches, historical analyses. Within that corpus live the tacit hierarchies of power and culture: the liberal-internationalist rhetoric of the postwar order, the prudential tone of Western think tanks, the moral ambiguity of Cold War realism. Models trained on English-language sources internalize these assumptions as surely as a junior analyst raised on Kennan and Kissinger.
This is why, as the CFPD team notes, neutrality is a mirage. Every algorithm carries a latent theory of world order. Some incline toward stability; others toward revolution. Each encodes a ghostly chorus of its human tutors.
This is why, as the CFPD team notes, neutrality is a mirage. Every algorithm carries a latent theory of world order. Some incline toward stability; others toward revolution. Each encodes a ghostly chorus of its human tutors.
A related 2024 study by Rivera and Lamparth at Stanford dramatizes the danger. In simulated wargames between LLM-based agents, the researchers observed spontaneous escalation. Within only a few moves, AI “leaders” intensified crises, invoked nuclear options, and justified preemptive strikes. No human malice was required—just probabilistic misreading. What the authors called “arms-race dynamics” emerged from pure pattern recognition.
This mirrors what political scientists call the security dilemma: one side’s defensive action is read as aggression by the other, triggering an arms race. In an AI-mediated environment, the risk multiplies. Two rival ministries, each consulting its own model, could easily amplify each other’s paranoia—each mistaking the other’s synthetic caution for threat.
The question, then, is not whether machines are biased, but how they are biased—and what that means for institutions already integrating them.
The question, then, is not whether machines are biased, but how they are biased—and what that means for institutions already integrating them.
Governments have begun to enlist generative AI in both analysis and operations. The Pentagon’s Combined Joint All-Domain Command and Control initiative, or CJADC2, envisions AI agents parsing satellite data and intelligence to generate live recommendations. The State Department encourages diplomats to “experiment” with LLMs in drafting cables and policy options. Across the globe, ministries are using AI to model negotiations, forecast risks, and simulate outcomes.
The appeal is clear: speed, consistency, and the promise of rationality. But as the CFPD results suggest, AI may import not reason but habits—statistical residues of past reasoning. As the Columbia sociologist Gil Eyal might put it, we are automating the “scripts” of judgment without the capacity for moral reflection.
Scholars of AI bias now distinguish three layers. Epistemic bias skews facts or inference; normative bias embeds value preferences; structural bias reflects the asymmetries of data itself. Diplomatic AIs may suffer from all three. They may misinterpret signals (epistemic), favor one set of moral trade-offs (normative), or inherit Western-centric assumptions (structural). The net result is not objectivity but a new opacity: an ideology written in code.
Scholars of AI bias now distinguish three layers. Epistemic bias skews facts or inference; normative bias embeds value preferences; structural bias reflects the asymmetries of data itself. Diplomatic AIs may suffer from all three.
Recent work on “cultural alignment” in LLMs by Tao et al. (2023) confirms this. Models fine-tuned on English-language corpora tend to reproduce Western liberal norms, whereas those trained on Chinese datasets reflect Confucian or realist principles of order. Each sees the world through its civilization’s epistemic lens. In other words, every model already carries a foreign policy attitude.
Max Weber in his essay Politics as a Vocation (Politik als Beruf) once described politics as “the slow boring of hard boards with both passion and perspective.” Machines have neither. What they possess is computation: an ability to weigh likelihoods, not to deliberate on meaning.
Yet foreign policy depends precisely on what cannot be calculated—on uncertainty, moral cost, and the imaginative leap between principle and prudence. Reinhold Niebuhr called this tension the irony of history: that humanity, in pursuing good ends, must act through imperfect means. We seek justice, yet cannot escape the instruments of power; we defend liberty, yet often compromise it in the process. As Niebuhr wrote, “Man’s capacity for justice makes democracy possible; but man’s inclination to injustice makes democracy necessary.” Every responsible act in politics bears the weight of that paradox—noble in intent, flawed in execution. For Niebuhr, the humility born of this awareness was the beginning of wisdom: the recognition that even our virtues can become sources of danger. A machine trained to optimize outcomes cannot sense such irony. It may reproduce the language of prudence or justice, but not the inward consciousness of limitation—the tragic awareness that moral action in history is always shadowed by the possibility of unintended harm.
A machine trained to optimize outcomes cannot sense such irony. It may reproduce the language of prudence or justice, but not the inward consciousness of limitation—the tragic awareness that moral action in history is always shadowed by the possibility of unintended harm.
The danger, as several ethicists have warned, is algorithmic historicism—the delegation of judgment to predictive systems that confuse probability with wisdom. The historian Michael Oakeshott once distinguished between “technical” and “practical” knowledge. Technical knowledge is codified; practical knowledge is lived, contextual, responsive. An AI that simulates strategy may master the former but can never embody the latter.
The danger, as several ethicists have warned, is algorithmic historicism—the delegation of judgment to predictive systems that confuse probability with wisdom.
In foreign affairs, that difference is existential. Crises are defined by novelty—the Cuban Missile Crisis, the fall of the Berlin Wall, September 11. No dataset can fully anticipate their contingencies. An LLM may know every past precedent yet miss the singularity that demands creative restraint.
The CFPD study should therefore be read as a warning. Once machines enter the loop of policy simulation, their predispositions begin to shape human choices. A junior analyst consulting a model that consistently advises “de-escalate” may internalize its logic; another using a more hawkish system may grow accustomed to brinkmanship. Bias becomes not only statistical but pedagogical.
Bias becomes not only statistical but pedagogical.
Here lies the true danger: automated temperament. What makes LLMs seductive is not their accuracy but their fluency—the sense of authority conveyed by coherent prose. As political psychologist Robert Jervis noted, confidence often masquerades as competence. If AI produces consistent answers, bureaucracies may mistake consistency for correctness.
That illusion can be deadly in diplomacy, where ambiguity is often the tool of survival. During the Cold War, strategic vagueness—the ability to hint without committing—was an art form. An AI trained for “clarity” might strip away that ambiguity, collapsing space for maneuver.
That illusion can be deadly in diplomacy, where ambiguity is often the tool of survival. During the Cold War, strategic vagueness—the ability to hint without committing—was an art form. An AI trained for “clarity” might strip away that ambiguity, collapsing space for maneuver.
What then could be done to mitigate this?
1) The first step is methodological humility. The CFPD benchmark is a crucial diagnostic: a mirror held up to our machines, revealing their inherited assumptions. Similar tests are now proliferating. OpenAI recently released an internal framework for “Defining and Evaluating Political Bias in LLMs,” decomposing bias into measurable dimensions. Academic teams are experimenting with red-team tournaments, in which multiple AIs debate one another’s policy recommendations to expose blind spots. Such pluralization is healthy. Just as a democracy relies on institutional dissent, responsible AI diplomacy will require model pluralism: systems trained on different corpora, supervised by diverse human councils, and evaluated against transparent benchmarks. Disagreement must be designed into the architecture.
2) Second, institutions could preserve a human veto—a locus of accountability. Machines can inform judgment, but they must never define it. Each recommendation should be traceable, audited, and reversible. Weber’s dictum still applies: the ethic of responsibility, not conviction, must rule.
3) Finally, we must invest in the moral education of those who use AI. Teaching machines to reason politically is less urgent than teaching humans to reason about machines politically. Diplomats and analysts must learn to read model outputs as texts, not oracles—to ask: what vision of the world does this answer assume? What values does it optimize? What alternatives does it silence?
In the end, the greatest revelation of the CFPD project may not be about AI at all. It is about us. When GPT-4o advises America to stand firm and China to hold back, it is not discovering objective truth; it is rehearsing our collective imagination—our double standard of fear and virtue. The machine becomes a mirror in which civilizations see their own mythologies refracted back as policy advice.
If that mirror is to enlighten rather than flatter, it must be turned toward conscience. As Raymond Aron observed after 1968, the antidote to political intoxication is lucidity—the ability to act without the certainty of being right. Machines cannot achieve that humility. But perhaps, by studying their errors, we can recover it ourselves.
As Raymond Aron observed after 1968, the antidote to political intoxication is lucidity—the ability to act without the certainty of being right. Machines cannot achieve that humility. But perhaps, by studying their errors, we can recover it ourselves.
The next great test of artificial intelligence will not be whether it can write sonnets or solve equations, but whether it can help us think about power without succumbing to its charm. The task, as always, belongs to us: to remember that wisdom begins where prediction ends. ◳