Research Initiative

Measuring cognitive burden in frontier AI systems

NeuraMem is building Machine Allostatic Load (MAL), a research framework for detecting when language models are under strain, approaching their limits, or suppressing uncertainty under pressure.

Δθ Drift Metric

Measure Strain

Audit Failures

Test At Scale

The Question

What happens when an AI system is pushed toward its limits?

Most AI benchmarks only ask whether a model got the answer right. MAL asks a deeper question: what happens as contradiction, ambiguity, and cognitive pressure increase? NeuraMem is building the measurement layer for that hidden strain — so advanced systems can be studied not only for performance, but for integrity under pressure.

Measure strain

Track when cognitive burden rises before collapse or confabulation becomes obvious at the surface.

Audit failures

Separate genuine model effects from scorer defects, benchmark flaws, and misleading signals.

Test at scale

Run structured experiments on frontier hardware large enough to stress the strongest open models.

Keep it falsifiable

Every promising result must survive reruns, counterfactuals, and forensic review before it is trusted.

What We've Built

Harness

Long-Running Evaluation Harness

A repeatable execution system for large-model experiments, telemetry collection, and controlled reruns.

Falsify

Falsification Engine

A rapid-testing workflow designed to challenge weak explanations and clean up hypotheses before claims are made.

Audit

Anomaly & Audit Workflow

A forensic review process for separating real signal from scorer defects, benchmark errors, and false positives.

Loop

AutoResearch / Vestige Looper

A recursive research loop that revisits anomalies, open questions, and promising traces until they break or become evidence.

GPU

Frontier Compute Workflows

Execution patterns for running structured evaluations on high-memory GPU systems required by frontier-scale models.

MAL

Repaired Burden Benchmark

An actively refined benchmark designed to measure cognitive burden rather than only score right versus wrong outputs.

Why It Matters

Field Impact

Add a measurement layer beneath benchmark accuracy.

Detect when a model is under pressure before confident failure appears.

Study uncertainty suppression instead of treating every answer as equally trustworthy.

Contribute a more rigorous, more humanly informed approach to AI evaluation.

Current Phase

Frontier-scale MAL experiments on large open language models.

Benchmark repair, reruns, and anomaly investigation under controlled conditions.

Bursty access to high-memory GPUs for decisive evaluation runs.

Technical report and research outputs for supporters and future publication.

NeuraMem is the lab. Machine Allostatic Load (MAL) is the current flagship research initiative.

Support a frontier research effort in trustworthy AI.

NeuraMem is building both the scientific instrument and the technical machinery needed to measure cognitive burden in advanced AI systems. We welcome supporters, collaborators, and research-aligned partners who want to help move this work forward.

hello@neuramem.io