Methodology Note · No. 01
Baseline Measurement: Author Identity in the Citation Behaviour of Language Models.
Pre-launch survey of the visibility of a German-language author in the answer layers of AI search.
Abstract
This Methodology Note sets the measurement frame for an open field laboratory that measures how language-model-based search systems, AI answer engines, and knowledge graphs take in, understand, and cite an author identity. Version 2.0 (13 May 2026 · T+2) revises the phase distinction underlying v1.x: the programme does not operate in “first validation, then action”, but in three temporally overlapping phases with continuously pre-registered interventions. Phase 1 (May → Sep 2026): active pre-launch phase with deliberate interventions on identity surfaces (Wikidata, Zenodo DOIs, GitHub, ORCID, Common-Crawl optimisation, Reddit karma building, machine-readable identity surfaces) alongside parallel instrument validation. Phase 2 (Sep 2026 → Q3 / 2027): post-launch effect detection — the book launch on 22 September 2026 as the central intervention with aggregated effect measurement across all surfaces. Phase 3 (from Q3 / 2027): long-term controlled experiments on the then-validated apparatus, with effect measures appropriate for n-of-1 designs (Interrupted Time Series, Bayesian Structural Time Series, hierarchical Bayes — not pre/post Cohen’s d on single actions). On the single case of a German-language high-fantasy debut, five lines of inquiry are independently operationalised: Citation Inventory, Measurement-Instrument Validation, Codebook Iteration, Open Materials, and Active Intervention Registration (eight pre-registered plays with status registered / active / skipped / deferred). Eleven measurement surfaces are surveyed — including the Cross-LLM Trust Graph, Common-Crawl Snapshot Inclusion Probe, and machine-readable identity surfaces (ai.txt + about.txt). Findings appear quarterly as standalone reports with raw data and replication archive.
Keywords citation behaviour of language models · measurement-instrument validation · test-retest reliability · CUSUM drift detection · Answer Engine Optimisation (AEO) · Generative Engine Optimisation (GEO) · knowledge graph propagation · author visibility · codebook iteration · pre-registration · reproducibility · AI search · IndexNow
1
Introduction
The answer layer of the internet is shifting. Language-model-based search systems, AI answer engines, and knowledge-graph aggregates take an increasing share in how authors and their works are found, understood, and cited in downstream answers. The mechanics of this new visibility layer — which actions carry which effects, with what latency, at what reach, and how stably — remain only fragmentarily documented empirically.
This Methodology Note sets the frame for an open single-case study. A precondition of any robust effect claim in this layer is a reliability-tested measurement apparatus. The observed measurement surfaces — language models, knowledge graphs, AI answer systems — are not stable instruments: their answers drift with model updates, indexing changes, and platform policy. Pre/post differences of an author action and instrument drift are not separable without a validated apparatus.
The strict phase separation laid down in v1.x (“first validation, then action”) did not survive contact with practice. From T+0 (11 May 2026) onwards, pre-registered interventions on identity surfaces ran in parallel with instrument validation — Wikidata curation as a latency probe, the Zenodo DOI salvo as a citation anchor, the GitHub repository as an identity bridge, IndexNow push as a crawler trigger. These operations are not observer bias at the margin: they are the programme. v2.0 makes the design honest.
The programme operates in three temporally overlapping phases. Phase 1 (May → Sep 2026): active pre-launch phase. Pre-registered interventions on identity surfaces (Wikidata co-occurrence, Zenodo DOI cadence, Common-Crawl optimisation, machine-readable identity surfaces ai.txt + about.txt, Reddit karma building, Cross-LLM Trust Graph) run in parallel with instrument validation of the measurement surfaces (test-retest reliability, intra-set consistency, coverage, model drift). Each intervention triggers an interrupted-time-series window on the affected surfaces; confounds between parallel interventions are explicitly reported. Phase 2 (Sep 2026 → Q3 / 2027): post-launch effect detection. The book launch on 22 September 2026 is the central deliberate intervention; aggregated effect measurement across all surfaces, long-tail observation of AI-answer reach. Phase 3 (from Q3 / 2027): long-term controlled experiments on the then-validated apparatus — with effect measures fitting an n-of-1 design (interrupted time series, Bayesian structural time series, hierarchical Bayesian models), not pre/post Cohen’s d on individual actions.
The epistemic value is not statistical generalisability; it lies in the external auditability of the methodology and in its applicability to other author identities — genre colleagues who wish to quantify their own visibility mechanics, as well as practitioners of search and answer engine optimisation (SEO / AEO / GEO) who seek a reproducible measurement basis rather than anecdotal claims.
2
Research Questions and Lines of Inquiry
How does the machine read an author? In Phase 1: what measurement properties do the instruments have with which we wish later to answer this question — how reliable, how drifting, how consistently operationalised; and what effects are measurable through active pre-launch interventions Q0–Q6. In Phase 3: on the validated apparatus, which levers of author visibility empirically carry how much weight?
2.1 Line 1 — Citation Inventory
Question. What does each measurement instrument show today of the defined author identity? Coverage per surface (knowledge-graph cards, AI answer areas, answer engines, classical search engine result pages) and per identity cluster (person, work, genre, world-mechanic). In Phase 1 descriptive, not inference-oriented.
2.2 Line 2 — Measurement-Instrument Validation
Question. How reliable is each measurement surface, with what drift characteristic does it respond to model updates and platform policy, and which surfaces are redundant with one another? Operationalised via test-retest correlation, intra-set consistency (Cronbach’s α), CUSUM drift detection, and inter-surface agreement.
2.3 Line 3 — Codebook Iteration
Question. Which operationalisation of "correct citation" is robust to edge cases, surface differences, and language-model style drift? Versioned annotation schema v0.x → v1.0, with inter-rater agreement via external annotators from Q4 / 2026.
2.4 Line 4 — Open Materials
Question. Are all findings externally auditable and reproducible for other author identities? What obstacles arise in method transfer, which components generalise as tool packages?
2.5 Line 5 — Active Intervention Registration (new in v2.0)
Question. Which deliberate actions are executed on which identity surfaces, with what temporal marker, and which measurement surfaces are tested in which interrupted-time-series window? Implicit before v2.0, now explicitly pre-registered: each intervention receives its own Q-number (Q0–Q6+) with YAML specification in Section 3.5, stopping rule, surface mapping, stop criteria, reporting plan. Confounds between parallel interventions are explicitly named; holdout periods are planned wherever methodologically tenable.
3
Data Sources and Operationalisation
Table 1 summarises sources and survey cadences. All endpoints
are addressed with the programme-wide user agent
marin-t-kael:research-tooling:v0.1 (by /u/marintkael).
Response headers are archived. Rate-limit signals (429, 503) trigger
exponential backoff.
| Measurement surface | Endpoint type | Cadence | Validation method |
|---|---|---|---|
| Wikidata Entity (author) | SPARQL · REST | daily | replication after 24 h |
| Wikidata Entity (book) | SPARQL · REST | daily | replication after 24 h |
| Google Knowledge Graph Search | API | daily (since 2026-05-14) | replication after 24 h |
| Bing Webmaster · AI indexing | API | daily | CUSUM on hit-rate |
| Google Search Console | API | daily | replication after 24 h |
| Google AI Overviews | browser snapshot | weekly | inter-snapshot agreement |
| Goodreads / Hardcover | GraphQL + HTML snapshot | daily (since 2026-05-14) | replication after 24 h |
| Reddit · public JSON | HTTP GET | after each post | snapshot hash comparison |
| Language-model probe (Gemini, Claude) | browser snapshot | weekly | replication after 24 h · model-version log |
| Cross-LLM Trust Graph (v2.0 · 11-LLM stack from v2.7) | source-attribution parser on 11-LLM answers (3 Anthropic tiers Haiku 4.5 / Sonnet 4.6 / Opus 4.7 via Message Batches API · OpenAI gpt-4o-mini + Search-Preview · Gemini 2.5 Flash · 5 Cloudflare Workers AI: Mistral 7B + Llama 3/3.1/3.2 + Phi-2) | daily (sync) · async-batch for Anthropic (max 24 h, typical <1 h) | 12 source patterns × trust weight (+2 / +1 / 0 / −1) |
| Common-Crawl Snapshot Inclusion Probe (v2.0) | CC index API + domain crawl | monthly (per snapshot) | URL inclusion rate · page coverage |
| Machine-readable identity surfaces (v2.0) | HTTP GET server logs on /llms.txt /ai.txt /about.txt | daily | crawler user-agent histogram per bot ID |
| Outbound drafts of the pipeline | file stream | before each dispatch | linter check against style-sheet |
The canonical truth for the assessment of all lines is the versioned style-sheet of the programme. It contains the fixed person, place, and world-mechanic strings as well as the excluded anti-patterns. Each survey is logged with the style-sheet version at the time of the survey.
4
Pre-Registration Protocol
Before the start of each survey, a pre-registration is published on this site. It contains the following fields in machine-readable form (YAML):
id: prereg-q0-wikidata-to-google-kg-latency field: 1 hypothesis: | Structured statements from Wikidata reach the Google Knowledge Graph within ≤14 days of entity publication. operationalisation: source: wikidata.org/entity/Q140004504 comparator: kgsearch.googleapis.com (Knowledge Graph Search API) measurement: ratio (matched canonical statements / total queried) sampling: start: 2026-05-11 end: 2026-09-22 cadence: daily 04:00 UTC stopping_rule: 134 surveys or saturation (no new hit in 14 days) analysis: descriptive statistics · visualisation of latency distribution version: v1.0 · 2026-05-10
Subsequent deviations from the pre-registered plan are permissible but must be openly disclosed in the report. In disputed cases the pre-registered plan counts as the hypothesis that was to be tested.
4.1 Active Interventions Q0–Q6 (new in v2.0 · Q6 in v2.3)
v2.0 makes explicit what ran implicitly in v1.x: Phase 1 contains deliberate interventions on identity surfaces. Each intervention is its own pre-registration with Q-number, hypothesis, stopping criterion, and effect-detection window. Full YAMLs are an appendix to the next quarterly report (Q3 / 2026); here a compact overview.
| Q-ID | Hypothesis (short) | Identity surface / action | Effect measurement surface | Status | Start |
|---|---|---|---|---|---|
| Q0 | Wikidata statements reach Google KG within ≤14 days | Wikidata curation (Q140004504 + Q140004740) | Google KG Search API · KG score | active (latency ≈ 40 h observed T+0 → T+2) | 2026-05-11 |
| Q1 | P136 genre statements (Q3294789 High Fantasy etc.) increase co-occurrence in the LLM cluster | Wikidata Co-Occurrence Engineering (6–8 P-statements) | Cross-LLM Trust Graph · CompCluster score | registered (execution Q3-2026) | 2026-05-14 (planned) |
| Q2 | Inclusion in CC-MAIN-2026-21 (May snapshot) increases LLM citation score in the next model cycle | Common-Crawl optimisation (/llms.txt /ai.txt /about.txt + backlinks) | Cross-LLM Trust Graph from Q4 / 2026 (LLM re-training lag) | active (files deployed T+2) | 2026-05-13 |
| Q3 | Source-attribution profile per LLM drifts < 1 trust point over 90 days (stability anchor for ITS) | Cross-LLM Trust Graph tracking (12 patterns × 11 LLMs from v2.7) | ai_citation_sources · CUSUM on trust-score mean | active (live from T+3 after cron 04:00 UTC) | 2026-05-14 |
| Q4 | Reddit comment karma > 200 in 6 subreddits in 90 days increases mention-cluster visibility | Reddit karma building (1× substantive comment / sub / week) | Reddit mention snapshots · Cross-LLM Trust Graph cluster "Research" | active (running in r/Fantasy since T+0; 7 more subs from T+3) | 2026-05-11 |
| Q5 | Zenodo DOI cadence (1 MN / quarter) triggers Wikipedia notability threshold crossing | Zenodo DOI salvo (MN-01 v1.x + v2.x, MN-02 Q3, MN-03 Q4) | Wikipedia article-existence probe (CC-MAIN coverage) | registered (MN-01 v2.0 live T+2; MN-02 ca. 2026-07-15) | 2026-05-13 |
| Q6 v2.3 | Consistent reader activity on Hardcover (reviews + mark-as-read + want-to-read) produces a reader-authenticity signal that strengthens cross-linking to Goodreads and the LLM trust cluster “reading community” | Reader-account activity volume on Hardcover (reviews, mark-as-read, want-to-read) | Hardcover snapshot pipeline · books_read · reviews_written · cross-LLM trust graph cluster “reading community” | active (since T+3, low-volume sustained) | 2026-05-14 |
5
Measurement Instrumentation
5.1 Style-Sheet as Canonical Truth
The programme-internal style-sheet defines, for every operationalised statement, a canonical string, a list of explicitly impermissible anti-patterns, and a severity grade. It is maintained under versioning; changes are traceable, every entry carries an effective date and a source reference.
5.2 Linter
The programme-wide linter (source file style_lint.py,
MIT-licensed) checks outbound drafts against the style-sheet and
against platform-specific rules (e.g. Reddit's 9:1 self-promotion
recommendation). It returns a classification into four severity
grades. S1 and S4 findings are blocking; S2 produces a warning; S3
is logged.
5.3 Assessment Metrics and Statistical Methods per Line
Table 2 summarises the primary and secondary metrics and the statistical procedures used per Phase-1 line. The full a-priori power analyses and stop thresholds are part of the respective pre-registration of the individual survey. Methodological inventory for Phase 3 (long-term controlled experiments from Q3 / 2027) — Cohen’s d, Bayes-factor stop procedure, a-priori power analysis, decay fit, ITS/BSTS — is documented in §§ 5.4 to 5.7 as a preview.
| Line | Primary metric | Secondary metrics | Statistical procedure |
|---|---|---|---|
| 1 | Hit-rate H = correct citations / queries (descriptive) | Coverage quota per identity cluster (person, work, genre, world-mechanic); hallucination rate | Descriptive statistics; Wald 95-% CI on H; coverage heatmap per surface × cluster (no inference on action-effect in Phase 1) |
| 2 | Test-retest correlation r (24-h replication) | Cronbach’s α intra-query-set; ICC (Intraclass Correlation) inter-snapshot; model-version logs | Pearson r with bootstrap 95-% CI; thresholds r ≥ 0.9 (API), 0.7 (LLM probe); CUSUM charts (h = 5) for drift detection; model-update markers |
| 3 | Inter-rater agreement Cohen’s κ (from Q4 / 2026) | Edge-case coverage of the schema; annotation difference per schema version | Cohen’s κ with bootstrap CI; threshold κ ≥ 0.7 as codebook version release; Krippendorff’s α as sensitivity check |
| 4 | Replication success rate by external auditors | Data-pin completeness, environment.yml executability | Qualitative 3rd-party audit report per replication archive; issue tracker on GitHub for reproduction failures |
5.4 Sequential Bayes-Factor Test (Phase-3 inventory)
For the long-term controlled experiments of Phase 3 (from Q3 / 2027), the Bayes factor BF₁₀ is updated by sequential collection. Stop thresholds are pre-specified: BF₁₀ > 10 confirms the hypothesis, BF₁₀ < 1⁄10 rejects it. Between, the survey continues until a threshold is reached or the sample maximum is exhausted. The procedure is documented here in Phase 1 as methodological inventory; it is activated only on the validated apparatus.
5.5 CUSUM Charts for Drift Detection (Phase-1 core)
For Line 2 (Measurement-Instrument Validation), a cumulative sum chart (CUSUM) detects systematic shifts of the hit-rate of a measurement surface against baseline. As soon as the chart exceeds a pre-specified alarm threshold, the affected surface is marked, the drift cause (typically: model update or platform change) is identified, and recorded in the model-version log. CUSUM is the central Phase-1 tool because it detects instrument drift early, before it is misinterpreted as an "effect observation".
5.6 A-priori Power Analysis (Phase-3 inventory)
For the long-term controlled experiments of Phase 3, the required sample size for a power of 1 − β = 0.80 at α = 0.05 is calculated before each survey — depending on the expected effect size. Large effects (d = 0.8) require roughly 125 observations to be detected at 80 % power; medium effects need ~310; small effects are barely securely detectable with ≤ 600 observations. The procedure is activated only on the validated apparatus from Q3 / 2027.
5.7 Latency and Half-Life Comparison (Phase-2 inventory)
Expected effect profiles of action classes differ widely. Fast platforms (IndexNow, Reddit) show effects within hours to days; identity-based levers (ORCID updates, Wikidata edits) need indexing cycles of days to weeks. This a-priori expectation is methodological inventory for the Phase-3 long-term controlled experiments from Q3 / 2027 and will then be mirrored against real measurements.
6
Reproducibility
Every publication of the programme is accompanied by a replication archive. The archive contains:
-
the measurement code as executable Python scripts (MIT licence) —
with frozen version pins via
environment.yml; - the raw measurement values as JSONL, including timestamp, endpoint URL, HTTP status, and response headers;
- the style-sheet in the state of the survey;
-
a short
README.mdwith reproduction instructions.
Pre-versions of the code are made available to external auditors on request before the first regular publication; the address is research@marin-t-kael.de.
7
Limitations
The programme's construction has methodological limitations, named here in advance so that they are not mistaken for a finding.
- Single-case study. The observation encompasses one work, one author, one language (DE primary, EN secondary). Findings are not generalisable; they connect to other carefully documented single cases.
- Conflict of interest. Programme lead and subject of inquiry are identical. The practice of pre-registration and the public failure log mitigate the risk of publication bias; they do not eliminate it.
- Endpoint volatility. Search and answer systems change during the active pre-launch period (algorithms, model versions, indexing logic). Versioned endpoints are addressed where possible; where not, the version state at the time of the survey is logged.
- Observation effect. The public programme changes the observed: platform reviewers, reader communities, and knowledge-graph editors may be influenced by the publication. The effect is qualitatively discussed in the quarterly report.
- Inter-Q confound (new in v2.0). Seven active pre-registrations (Q0–Q6) run temporally parallel on partly overlapping measurement surfaces. Isolated effect attribution of a single Q is therefore mostly not possible; instead, effects are reported as combined movement of the measurement surface with a plausibility discussion of which Q probably contributed. Holdout periods are planned wherever methodologically tenable (see Q4 as example). The Phase-3 apparatus (Q3 / 2027+) will be able to reduce confounds through single-subject reversal designs.
8
Ethical Commitments
- No collection of personal data of third parties beyond what is publicly available.
- No aggregation or transfer of reader material to third parties; no use as training data for models.
- Full compliance with platform policy; violations are openly logged and corrected.
- Citation of individual contributions only with explicit consent of the writer.
- On request of a platform, data retrieval is immediately discontinued and the instruction is published as a finding.
9
Publication Cadence
Quarterly reports appear in mid-October, January, April, and July, with a tolerance of ±two weeks. Methodology notes appear ad hoc on substantive changes to the measurement frame. Pre-registrations appear before the start of each new survey. A field report supplements the programme with specific topics outside the quarterly logic.
10
Citation
Kael, M. T. (2026). Baseline Measurement: Author Identity in the Citation Behaviour of Language Models (Pre-Launch). Methodology Note 01, Marin T. Kael — AI Citation Behaviour Lab. Marin T. Kael, Independent. Version 3.0 (DE + EN bilingual, 18 May 2026 · Methodology Note now ships both language PDFs in a single Zenodo record for full international reach in the GEO/AEO community + arXiv audience. PDFs consistent with v2.9 content (bilingual bundling, aggregate methodology tightened, Gemini Direct Batch integrated). Consistency bump v2.9 → v3.0 reflects that this is the first fully bilingual-conformant Methodology Note record). Version DOI: 10.5281/zenodo.20308495 · Concept DOI (version-stable): 10.5281/zenodo.20125933 · Predecessor v2.9 (superseded): 10.5281/zenodo.20262237.
BibTeX
@techreport{kael2026citation,
author = {Kael, Marin T.},
title = {Baseline Measurement: Author Identity in the Citation Behaviour of Language Models (Pre-Launch)},
institution = {Marin T. Kael, Independent},
type = {Methodology Note},
number = {01},
series = {Marin T. Kael — AI Citation Behaviour Lab},
year = {2026},
month = {5},
version = {v3.0},
language = {german + english (bilingual)},
doi = {10.5281/zenodo.20308495},
url = {https://doi.org/10.5281/zenodo.20125933},
note = {v2.8 methodological extension (2026-05-17 evening): aggregate methodology tightened — API call failures (quota exhaustion, HTTP 5xx, timeouts) that previously returned null and were silently scored as 'not_found' / score=0 are now propagated as status='error' and excluded from the aggregate denominator. Per-LLM rows now carry n_attempted / n_legit / n_error, pure-error LLMs flagged 'unavailable'. Retroactive re-aggregation of 41 historical snapshots, mean delta +10.64 pp between pre-fix and post-fix aggregates. Both values persisted in parallel via D1 Migration 0013 (score_percent / score_percent_v271) for full audit trail. v2.8 also adds Gemini 2.5 Flash via the Gemini Batch API direct (analog to API Batches, API-key-based, no Service-Account-JWT-signing). Predecessor v2.7 (DOI 10.5281/zenodo.20258380, superseded) · Concept-DOI: 10.5281/zenodo.20125933 (version-stable).},
}