RISEI Lab Research · Innovation · Elevation Est. 2021 Northwestern University $50M+ Lifetime Funding 170+ Countries Evanston, Illinois RISEI Lab Research · Innovation · Elevation Est. 2021 Northwestern University $50M+ Lifetime Funding 170+ Countries Evanston, Illinois
NBER Working Paper No. 35110 · RISEI + Policy Brief

How (un)Stable Are LLM Occupational Exposure Scores?

Every major forecast about which jobs AI will eliminate comes from asking AI to rate itself. We found the answer depends entirely on which AI you ask. If the measurement is unstable, the policies built on it are too.

Michelle Yin, Hoa Vu, & Claudia Persico

NBER Working Paper No. 35110 · April 2026
Northwestern University & American University · NBER
JEL: C81, J23, J24, O33

19×
High-exposure share varies across 4 AI raters on identical data (2.7% Gemini to 51.5% Claude)
3.6×
Mean direct-exposure score variation across raters (14% Gemini to 51% Claude)
4.6×
Management occupations flagged "high-exposure": 18% under Gemini, 83% under Claude
4
Frontier AI raters tested: GPT-4, GPT-5, Gemini 2.5, Claude 4.5

Key Findings

The Headline Statistic Varies 19-Fold

Replicating the dominant rubric (Eloundou et al., 2024) with four frontier AI raters (GPT-4, GPT-5, Gemini 2.5, Claude 4.5) on identical O*NET task data, the flagship Eloundou statistic, the share of US occupations with more than half of tasks at high direct exposure, ranges from 2.7% under Gemini 2.5 to 51.5% under Claude 4.5. That is a 19-fold spread. Mean direct-exposure varies 3.6-fold (14% to 51%). Same data, same procedure, four very different lists of who is most at risk.

Direction Is Consistent, Workers Identified Are Not

In standard difference-in-differences employment regressions, all four models produce negative coefficients (high-exposure occupations grow more slowly post-2022). But the regression's identification of which occupations drive the result shifts sharply across raters. Under Claude, 83% of management occupations are high-exposure; under Gemini, only 18% are. Two analyses with different raters arrive at the same broad conclusion through different sets of workers, with very different policy implications.

Adoption Drives Capability Measurement

Occupations with higher observed AI usage show significantly larger increases in measured exposure across model generations (coefficient = 0.335, p < 0.05). The measurement instrument evolves with adoption, creating a feedback loop that systematically underrepresents communities with lower AI access.

Global Policy Built on Narrow Data

Only 16.3% of the global population has ever used generative AI. The BLS, OECD, ILO, IMF, and WEF all use these scores for employment projections affecting billions. We are making policy for the whole world based on data from 1 in 6 people.

Figure 3: E1 Exposure by Detailed Occupation and Annotator

95 three-digit SOC occupations ordered by cross-model disagreement. The largest spreads concentrate in supervisory roles and occupations that combine cognitive and physical tasks — precisely where the boundary between LLM-capable and non-LLM-capable work is most ambiguous. Spreads reach up to 77 percentage points on identical tasks, solely because a different model assigned the scores.

Figure 3: E1 exposure by detailed occupation and annotator, showing 95 three-digit SOC occupations with up to 77 percentage point spreads across ChatGPT-4, ChatGPT-5, Gemini 2.5, and Claude 4.5
How to cite this figure:
Yin, M., Vu, H., & Persico, C. (2026). Figure 3: E1 exposure by detailed occupation and annotator. In How (un)stable are LLM occupational exposure scores? Evidence from multi-model replication (NBER Working Paper No. 35110, p. 32). National Bureau of Economic Research.

https://riseilab.org/pub-ai-measurement.html#figure3

Read the Paper & Policy Brief

Working Paper — Yin, Vu & Persico (2026)
Policy Brief No. 2026-03 — Adaptive Precision Framework

The Adaptive Precision Framework

If AI capabilities are a moving target, adoption varies enormously, and the measurement instruments are circular, what is the alternative? The companion policy brief proposes Adaptive Precision: using AI-enabled real-time data to continuously recalibrate what we teach, how we hire, how we design jobs, and how we deliver services.

Personalized Learning

Curricula adjusted each semester based on sector-specific adoption data.

Personalized Hiring

Assessment rubrics recalibrated as new populations adopt AI tools.

Personalized Job Design

Task bundles restructured continuously as AI capabilities evolve.

Personalized Services

Workforce development tailored to each community's adoption landscape.

All Publications Research & Projects RISEI Lab

Press & External Coverage