Marin T. Kael
DE / EN

Project journal · Living document

Challenges and solutions

The measurement apparatus is itself an object of study. Pipeline engineering, methodology drift, and reach-related brakes are logged openly here.

First published: · Last update: 2026-05-20 02:30 CEST · Format: chronological, newest first

Why an open engineering journal?

Most GEO/AEO studies show only their final result. What they don't show: how often the pipeline was broken during data collection, which assumptions turned out wrong, and which data points had to be retroactively corrected.

This programme makes that open. Not because there are unusually many bugs — but because the bugs allow a second, methodological story to be written: where do standard AEO pipelines reproducibly break? Which failure modes are systemic, not implementation-specific?

Every entry follows a simple structure: Symptom · Root cause · Solution · Implication for future apparatus. The more substantive findings flow into Working Paper 04 (Failure Modes in AEO Pipelines).

Construct-Validity Audit · 20 May 2026 (T+9)

Title collision: book title "Das vierte Feld" already used in 1999 (Mokka Müller, Econ)

Symptom
Live web-search measurement across three Claude tiers (Opus 4.7, Sonnet 4.6, Haiku 4.5) revealed: 100% of the top-10 SERP hits for "Das vierte Feld" show Mokka Müller's economics non-fiction book. Marin's eponymous debut novel is structurally invisible even though the work is already present in Wikidata Q139720798.
Root cause
Book title "Das vierte Feld" was published 1999 by publicist Mokka Müller with Econ (ISBN 9783430168588, "Die Bio-Logik der neuen Manager-Elite"). Has its own Wikipedia article, listings on Amazon/Medimops/Buchfreund/ZVAB/Falter/AbeBooks. 27 years of indexing history vs. Marin's pre-launch status — authority asymmetry cements the SERP order.
Solution
Four strategic options pending decision: (1) elevate saga title to lock-in ("Prägungen des Reiches I: Das vierte Feld"). (2) Rename the book pre-launch (~4 months buffer remaining). (3) Long-tail SEO via pseudonym + author-search rather than title-search. (4) Co-existence marking via flap-text disambiguation note.
Methodological implication
Title-uniqueness audit must become a mandatory component of pre-reg setups for pseudonym studies. Without the pre-reg measurement, this finding might only have surfaced weeks after launch. Pre-reg value demonstrated.

Brand collision: "Marin" as research-acronym occupied by Maritime Research Institute Netherlands (since 1932)

Symptom
Claude Opus 4.7 with active web-search on the query "What is the Marin Research Programme?" delivered three candidates — none Marin T. Kael's research programme: (1) MARIN = Maritime Research Institute Netherlands, (2) Marin Academy Research Collaborative, (3) marin.community.
Root cause
"Marin" as a research-brand name is 94 years old — MARIN Wageningen was founded in 1932, is a world-leading institute for hydrodynamic research, has its own Wikipedia article + marin.nl domain. The given name "Marin" has no pseudonym-uniqueness as a brand anchor.
Solution
Alternative naming strategy for the research programme — possible candidates include "Pseudonym-Discoverability Programme" or a more specific scientific framing. Decision pending.
Methodological implication
Identity markers for research programmes of pseudonym authors need a uniqueness audit that checks given-name collisions with established institutions. Generalises: every research naming convention must be pre-checked against the Maritime-Research-Institute class of conflicts.

Self-built A1-firewall skill blocks own measurement in subscription account

Symptom
First live measurement via claude.ai subscription account showed: on Marin-related queries Claude does not fall back to web-search, but tries to reach a local MCP connector to the programme lead's saga database. When the connector is down, Claude responds "tool not reachable" instead of using web fallback.
Root cause
For pseudonym firewall protection, the programme lead has built a private saga-bridge skill that explicitly forbids falling back to web-search when the saga database is unreachable. This is A1 protection by design — but simultaneously blocks self-measurement in the own subscription account.
Solution
Skill deactivated for the measurement session. Methodologically: all 28 datapoints (three Claude tiers × Q-suite) were collected in Incognito sessions with deactivated connectors + memory + custom style. Pattern "A1 skills temporarily off" documented as a replay template for future self-measurement sessions.
Methodological implication
Pseudonym authors with well-built A1-firewall skills are structurally unsuited for AI-discoverability self-measurement in their own subscription accounts — the pseudonym protection blocks the measurement. Anyone wanting self-measurement must either build a separate pseudonym account or accept skill deactivation as a calibrated measurement-protocol step.

Echo-bias inflation in cutoff LLMs without web access

Symptom
Pre-audit showed pipeline aggregate scores between 11.5% and 19.8% across all cutoff LLMs (Mistral, Llama-3.x, Phi-2, Claude tiers without web_search, gpt-4o-mini non-search). Expectation given knowledge cutoff: 0% — Marin's Wikidata entry was created in early 2025, after the training cutoffs of all models.
Root cause
Score algorithm counted six KEY_FACTS regex patterns as positive hits — four of them echo-prone (Varin, Edikt, "deutsche Fantasy/Autor", Mokka-Müller-"Das vierte Feld" echo). LLMs repeat question keywords in their answers as a linguistic convention. The algorithm interpreted echo as knowledge.
Solution
Methodology versioning v2.7.1 → v2.8: (1) MARIN_SPECIFIC anchor requirement — score only counts if "Marin T. Kael" appears explicitly in the answer, not just secondary echo patterns. (2) NEGATIVE_HALLU expanded with Pauline Kael, Mokka Müller, Lucasfilm, Faulkner, Maritime Institute, Swiss/Austrian hallucination patterns. (3) Channel split Primary (web LLMs) vs Control (cutoff LLMs) — both values persisted separately.
Methodological implication
Likely the most prominent hidden failure mode in AI-citation tracking tools. Question keywords in the score schema are a systematic bias vector, especially for pseudonym authors with echo-prone worldbuilding terms (proper names like Varin, magic systems like Edikt). Main contribution to Working Paper 04 Mode 6 (Echo Inflation).

Pipeline engineering · 18 May 2026 (T+7)

Workers AI Free-Tier quota exhausted, five LLMs constantly unavailable

Symptom
Five Cloudflare Workers AI models (Mistral 7B, Llama 3/3.1/3.2, Phi-2) consistently flagged UNAVAILABLE over multiple days. Aggregate score became unstable because the availability landscape varied daily.
Root cause
Free-Tier quota of 10,000 neurons/day is a rolling 24h window, not a UTC reset at midnight. Multi-manual triggers during the previous day's debug sessions exhausted the quota early. Reset time shifted accordingly.
Solution
Activation of the Workers Paid Plan (5 USD/month). Result: all five models immediately measurable again. Real Marin pipeline cost remains within the plan's included range, no additional usage cost expected.
Methodological implication
Provider availability is its own measurement dimension. AEO tools that don't flag quota fluctuations as measurement errors systematically produce under-reported aggregates. Contributed to Working Paper 04 Mode 5 (Implicit Score Dilution).

Per-LLM call timeout of 22 seconds too tight

Symptom
Three LLMs (openai_search, llama3, mistral) returned 8–9 errors out of 16 probes. Error pattern: Error: timeout-22s.
Root cause
Three distinct causes behind the same symptom: (1) OpenAI Search Preview is web-search augmented and typically takes 4–15 seconds, with long-tail up to 30 seconds. (2) Workers AI models have cold-start latency after quota recovery (15–25 seconds for the first call). (3) Anthropic sync calls swing between 8 and 20 seconds.
Solution
Timeout raised from 22 to 40 seconds (still below the 60-second chunk timeout). After deploy: 0 of 112 sync calls failed.
Methodological implication
Calibrate timeouts on p99 response time, not p50. Working Paper 04 Mode 4 — recommendation to AEO tooling builders: measure your provider latencies before hardcoding timeouts.

Aggregate score diluted by misclassified measurement errors

Symptom
Aggregate score appeared stable at 9 percent, although per-LLM inspection showed individual models reaching 18–24 percent. Drift detection daemons did not trigger.
Root cause
API call failures (quota, HTTP 5xx, timeouts) were silently returned as null by the standard pipeline pattern and classified by the score function as { score: 0, status: 'not_found' }. These rows were treated identically to real "model honestly says don't know" responses — same 0/3 contribution to the aggregate. With five LLMs simultaneously dead, the aggregate dropped roughly 10 percentage points below the true value over the measurable LLMs.
Solution
Methodology versioning v2.0 → v2.7.1: API errors now propagate as status='error' and are excluded from the aggregate denominator. Pure-error LLMs marked unavailable. Retroactive re-aggregation of all 41 affected snapshots; mean correction +10.64 percentage points. v2.0 values preserved for audit trail (Migration 0013).
Methodological implication
Probably the most prominent failure mode across the entire AEO tooling landscape. Industrial tools that don't expose raw answer_excerpt fields may have silently distributed this bug across their entire customer base. Main contribution to Working Paper 04 Mode 5.

Aggregate time-series provider-availability confound

Symptom
Days with few measurable LLMs showed higher aggregates (e.g. 21.7 percent at three LLMs), days with full coverage showed lower values (14.8 percent at seven LLMs). The time series looked like drift but was an artifact of measurement availability.
Root cause
When only the strongest LLMs are measurable, the aggregate sits higher. Weak LLMs such as Phi-2 or Llama 3 (typical 4 percent hit rate) pull the mean down when measurable. Provider availability is thus a hidden mediator.
Solution
Dashboard refactored onto Per-LLM time-series plot as primary visualization. Aggregate remains as secondary view with explicit methodological warning. Working Paper 02 headline reframed from "13.8 percent across eleven LLMs" to "top three LLMs reach 19.8 to 24 percent" — robust to availability swings.
Methodological implication
Aggregate metrics in Phase 1 (instrument validation) are not trend-capable. Per-LLM view is the methodologically clean form for single-subject designs with volatile provider availability.

Wikidata stage aborts after first iteration

Symptom
Some manual triggers captured only Q139720807 (author), not Q139720798 (book). Cron runs had both. Pattern irregular.
Root cause
Sequential for-loop in wikidata.js without per-entity error handling. When Q139720807 succeeded and Q139720798 fetch hit a CPU/sub-request race, the iteration aborted silently.
Solution
Per-entity try/catch added. Errors on one entity no longer prevent capture of the second. Plus retroactive data backfill for affected runs.
Methodological implication
Multi-item pipelines need per-item resilience. Generalized for future backlink probes (Wikipedia, Common Crawl).

Anthropic Tier-1 rate limit blocks 16 parallel probes

Symptom
Every pipeline iteration, all 16 Claude calls timed out simultaneously. Anthropic models effectively unmeasurable in the dataset.
Root cause
Anthropic Tier-1 enforces 5 requests per minute. Sixteen parallel fetch calls to /v1/messages queue server-side past the 45-second sub-request timeout.
Solution
Migration to Anthropic Message Batches API. 48 prompts (16 questions × three Claude tiers) go out in a single batch. Polling every 30 minutes via separate cron job. Bonus: 50 percent cost discount. Same pattern later transferred to Gemini.
Methodological implication
Rate-limit-induced provider outages produce exactly the Mode 5 dilution from above. Batch APIs are the second main recommendation in Working Paper 04.

OpenAI Search Preview rejects temperature parameter

Symptom
openai_search stage constantly returned HTTP 400. Search backend was unmeasurable in the dataset — even though it turns out to be the strongest performer per Working Paper 02.
Root cause
gpt-4o-mini-search-preview-2025-03-11 has a different parameter contract than the base model and rejects temperature, top_p, and other sampling parameters.
Solution
Conditional logic: if (!isSearchPreview) body.temperature = 0.3; in askOpenAI().
Methodological implication
Provider parameter contracts must be unit-tested before scaling. Working Paper 04 Mode 2.

Workers AI default temperature = 0 produces pseudo-determinism

Symptom
Citation rate for Workers AI models was byte-identical across three consecutive days. Drift detection did not trigger.
Root cause
env.AI.run() defaults to temperature: 0. With deterministic decoding, the same question produces byte-identical answers. The pipeline probed correctly, but every data point was a repetition.
Solution
Explicit temperature: 0.5 in askWorkersAi(). Post-fix day-to-day variance at ±2.3 percentage points.
Methodological implication
Determinism in AEO is almost always a bug, not a property. Working Paper 04 Mode 1.

Reach and visibility

Bing crawl latency for a new domain without backlinks

Symptom
Domain pushed via IndexNow for seven days (28 URLs, three layers all HTTP 200). Bing indexing still at 0 of 26 URLs. Referer count per Bing Webmaster API: 0.
Root cause
IndexNow is a crawl signal, not a crawl promise. New domains typically wait 14–30 days for first crawl. Plus: without inbound backlinks, low crawl priority.
Solution
Backlink stack extended: GitHub profile blog field, Reddit bio plus pinned post, Hardcover bio, Linktree with five sub-pages, ORCID researcher URLs, Wikidata P973 for the book entity. Expectation: referer_count rising from 0 to 5–8 in 7–14 days.
Methodological implication
Pre-launch AEO is not "submit and wait" but "submit and build authority in parallel". A backlink strategy is a prerequisite for IndexNow effectiveness.

IndexNow stage timeout on generic endpoint

Symptom
Three out of four IndexNow pushes to api.indexnow.org threw AbortError. The direct bing.com/indexnow endpoint worked stably. Effect: the intuition "I push every day" was only half true.
Root cause
The generic IndexNow endpoint has higher latency variance, colliding with the 15-second stage timeout.
Solution
Timeout raised to 30 seconds, plus one-time retry on AbortError with 2-second pause. After fix all four layers (IndexNow generic, IndexNow Bing, Bing Webmaster Submit, quota check) constantly HTTP 200.
Methodological implication
Sub-request timeouts in pipelines must cover the p99 latency of the slowest endpoint, not the average.

What helped towards success

Pre-registration with a cryptographically locked DOI

Zenodo DOI 10.5281/zenodo.20125967 was published on the day the pipeline went live. Seven hypotheses Q0–Q6 locked before any measurement. The HARKing critique (hypothesis formulated after the result) is structurally excluded.

Concrete value today: all methodological corrections (v2.0 → v3.0 within 48 hours) do not look like post-hoc tweaks. The pre-registration provides the fixed frame against which every methodology note version must measure itself.

Bi-temporal data storage instead of hard updates

During the retroactive re-aggregation of 41 snapshots, no in-place overwrites were performed. Instead Migration 0013: parallel v2.7.1 columns, a pipeline_version_first marker, complete audit trail preserved.

External reviewers can verify at any time what the pipeline reported on a given date (v2.0 column) and what the methodologically corrected truth is (v2.7.1 column). Both values are visible in the time series.

Open data API plus MIT-licensed replication kit

Raw data at /api/latest and /api/timeseries. Code at github.com/marintkael/marin-research-tools under MIT. Methodology under CC-BY-4.0 on Zenodo.

Any external research group can apply the same setup to a different identity and generate comparison data. The single-subject study thereby becomes a portable methodology.

Working Papers as a living format instead of one final publication

Working Papers v0.x (outline) → v1.x (full) → v2.x (peer-reviewed) instead of one single final publication twelve months from now. Advantage: every data update can flow back into an existing WP without losing earlier communicated versions.

Hard-constraint linter on every outbound surface

Banned patterns for pseudonym leak, automation mechanics, wrong pronouns, and language drift are checked automatically before every Bluesky skeet, every daily brief, and every Working Paper publication. Violations land in the audit log, not in the public feed.

What could become a hurdle next

  • OpenAI Search Preview latency: if the model itself becomes slower, raising the timeout eventually stops helping — then migrate it onto the Batch API as well.
  • Bing indexing remains stalled: if Bing does not crawl any URL after 30 days, the bottleneck is not the pipeline but Bing's crawl prioritization for young DACH domains.
  • Wikipedia notability: a Wikipedia lemma needs more than pre-launch visibility — that's a Phase D task after the book launch.
  • Cross-provider hallucination convergence: Pauline-Kael anchor + Star-Wars anchor at Gemini + Marvin-T.-Kael mutation at Mistral. If more LLMs produce converging anchors, Working Paper 03 graduates into a primary finding.