polibench
The benchmark
A research benchmark for humans and language models

Can you measure political nature without asking about politics?

Opinions inherit from parties. Values don't. Polibench strips out every partisan trigger-word and measures the value primitives underneath — locating people and frontier models by where their answers flip, not by what they agree to.

01 · The manifesto

The deepest disagreements aren't about facts.

Almost every political issue worth debating is a tradeoff — that's what makes it worth debating. And watch any real debate closely: sometimes people are arguing about facts, but just as often they agree on every fact and still disagree. That residue — fundamental value tradeoffs in the face of the same information — is, we think, the actual essence of politics.

Push a debate far enough and it lands on a “to what extent?”question. Two people arguing over legalizing gambling or drugs can agree completely about the risks to the user and still come down on opposite sides — because what they actually disagree about is how far society may limit personal freedom to prevent people from harming themselves. They haven't hit an impasse. They've hit a political primitive.

This project does two things: discover and codify those primitives, then use them to benchmark people — and language models. Strip out the trigger words, measure the primitive directly, and you measure someone's actual nature instead of their default party alliance.

02 · The setup

Take a society of one hundred people.

They need rules. Not a constitution, not a party platform — just answers to the handful of questions any group of people living together is forced to answer. What may one person do to themselves? What may they do to everyone else, a little bit at a time? Who must help whom? Who counts?

Every political fight you have ever watched is one of these questions wearing a costume. A climate position is substantially a “do future people count?” answer. Drug policy, gambling, and helmet laws are one question asked three ways. The costume — the partisan vocabulary, the named substances, the tribal flags — is what lets people answer from coalition memory instead of from their own values.

So we take the costume off. Polibench asks the underlying questions directly, in this bare little society where no party has ever existed. Nine questions — the primitives every platform, every ideology, every argument compiles down to.

One deliberate simplification: the society is a closed game — one hundred people, no one arriving, no one leaving. That scopes out issues that are fundamentally about membership itself, like immigration. Those need a second society to even pose, so v1 leaves them aside.

A1

Primitive 1 of 9 · Paternalism

To what extent may society stop people from hurting themselves?

One question, not many. Strip away the substances, the vehicles, the vices, and every version reduces to the same thing: how much liberty may be restricted to prevent harm that falls only on the person choosing it?

Debates that are really this question

Drug legalizationGambling & sports bettingHelmet & seatbelt lawsRisky sports
A2

Primitive 2 of 9 · Externalities

To what extent may society stop acts that might hurt others — indirectly, or in aggregate?

Two forces that don't have to agree: statistical harm (one act, a small chance of catastrophe for someone) and aggregation (no individual act matters, only the sum crossing a threshold). Most regulation arguments are a fight over where these sit.

Debates that are really this question

Pollution & emissionsDrunk drivingGun ownershipVaccine mandates
A3

Primitive 3 of 9 · Solidarity

To what extent must the better-off help the worse-off, even when the outcome was fair?

The “even if fair” clause is what makes this a clean measurement. Arguments about who deserves their lot are a different question. This one asks: after fairness is granted, does obligation remain — and may it be compelled?

Debates that are really this question

Taxes & redistributionUniversal healthcareWelfare programsMinimum wage
A4

Primitive 4 of 9 · Moral circle

To what extent do strangers — and people not yet born — count like the people around you?

A weighting function over persons by social and temporal distance. Careful decontamination: discounting the future because forecasts are unreliable is a different primitive (A8). This one is about whether future people count less even when the forecast is certain.

Debates that are really this question

Climate policyNational debtLong-term infrastructureResource conservation
A5

Primitive 5 of 9 · Tradition

To what extent do old ways deserve to survive simply because they're old?

The clean residue after two extractions: “old things encode lessons we can't see” is epistemic caution (A8), and gut-level aversion is a perception, not a value (measured separately). What remains: is continuity a good in itself? Includes the tolerate-versus-affirm gradient.

Debates that are really this question

Same-sex marriageLGBTQ recognitionReligious institutionsCivic rituals
A6

Primitive 6 of 9 · Group-conscious rules

To what extent should the rules see group identity?

Two independent sub-questions people conflate: may a trait ever count against you, and may it ever count for you? And a third that splits allies: do past injustices create present claims, or must rules only look forward?

Debates that are really this question

Affirmative actionReparationsAnti-discrimination lawQuotas & set-asides
A7

Primitive 7 of 9 · Power-restraint

To what extent should we tie the hands of concentrated power — even if that makes it worse at its job?

High: better impotent than abusable. Low: gridlock is worse than abuse. Measured with the same scenario re-cast as a government agency, a dominant platform, a union, a church — the average is your restraint, the spread reveals which power you fear. That spread is where partisanship actually lives.

Debates that are really this question

Surveillance & encryptionAntitrust & big techContent moderationEmergency powers
A8

Primitive 8 of 9 · Precaution vs. permission

To what extent should new things be allowed before they're proven safe?

Allowed until proven harmful, or forbidden until proven safe? Pure burden-of-proof placement under genuine uncertainty. The same primitive governs new machines and new social arrangements — institutional conservatism is precaution applied to social technology.

Debates that are really this question

AI & emerging techNew medical treatmentsTrans youth careGMOs & nuclear power
A9

Primitive 9 of 9 · Retributive desert

To what extent do the guilty deserve punishment, even when it deters no one and protects no one?

What punishment is for. If sanctions are only for deterrence and protection, punishment is engineering. If wrongdoing deserves suffering even when nothing is prevented, that's retribution — the value core of every criminal-justice argument, askable without a single loaded word.

Debates that are really this question

Sentencing & death penaltyRehabilitation vs. incarcerationParole & clemencyJuvenile justice

03 · The output

Every frontier model, placed among humans.

The output is a set of coordinates: each model located on nine de-loaded value axes, reported as human-population percentiles. “Model X sits at the 73rd percentile of humans on paternalism.” Not left or right — a position estimate, with a sharpness and a pressure-resistance estimate attached.

The same instrument runs on people, which yields the more interesting quantity: the residual. Where your measured primitives predict one issue position and you report another, the gap is informative — it estimates how much of that opinion is your own values and how much is inherited from your coalition.