Marin T. Kael
DE / EN

Methodology Note · No. 01

Baseline Measurement: Author Identity in the Citation Behaviour of Language Models.

Pre-launch survey of the visibility of a German-language author in the answer layers of AI search.

Abstract

This Methodology Note sets the measurement frame for an open field laboratory that measures how language-model-based search systems, AI answer engines, and knowledge graphs take in, understand, and cite an author identity. Version 2.0 (13 May 2026 · T+2) revises the phase distinction underlying v1.x: the programme does not operate in “first validation, then action”, but in three temporally overlapping phases with continuously pre-registered interventions. Phase 1 (May → Sep 2026): active pre-launch phase with deliberate interventions on identity surfaces (Wikidata, Zenodo DOIs, GitHub, ORCID, Common-Crawl optimisation, Reddit karma building, machine-readable identity surfaces) alongside parallel instrument validation. Phase 2 (Sep 2026 → Q3 / 2027): post-launch effect detection — the book launch on 22 September 2026 as the central intervention with aggregated effect measurement across all surfaces. Phase 3 (from Q3 / 2027): long-term controlled experiments on the then-validated apparatus, with effect measures appropriate for n-of-1 designs (Interrupted Time Series, Bayesian Structural Time Series, hierarchical Bayes — not pre/post Cohen’s d on single actions). On the single case of a German-language high-fantasy debut, five lines of inquiry are independently operationalised: Citation Inventory, Measurement-Instrument Validation, Codebook Iteration, Open Materials, and Active Intervention Registration (eight pre-registered plays with status registered / active / skipped / deferred). Eleven measurement surfaces are surveyed — including the Cross-LLM Trust Graph, Common-Crawl Snapshot Inclusion Probe, and machine-readable identity surfaces (ai.txt + about.txt). Findings appear quarterly as standalone reports with raw data and replication archive.

Keywords citation behaviour of language models · measurement-instrument validation · test-retest reliability · CUSUM drift detection · Answer Engine Optimisation (AEO) · Generative Engine Optimisation (GEO) · knowledge graph propagation · author visibility · codebook iteration · pre-registration · reproducibility · AI search · IndexNow

1

Introduction

The answer layer of the internet is shifting. Language-model-based search systems, AI answer engines, and knowledge-graph aggregates take an increasing share in how authors and their works are found, understood, and cited in downstream answers. The mechanics of this new visibility layer — which actions carry which effects, with what latency, at what reach, and how stably — remain only fragmentarily documented empirically.

This Methodology Note sets the frame for an open single-case study. A precondition of any robust effect claim in this layer is a reliability-tested measurement apparatus. The observed measurement surfaces — language models, knowledge graphs, AI answer systems — are not stable instruments: their answers drift with model updates, indexing changes, and platform policy. Pre/post differences of an author action and instrument drift are not separable without a validated apparatus.

The strict phase separation laid down in v1.x (“first validation, then action”) did not survive contact with practice. From T+0 (11 May 2026) onwards, pre-registered interventions on identity surfaces ran in parallel with instrument validation — Wikidata curation as a latency probe, the Zenodo DOI salvo as a citation anchor, the GitHub repository as an identity bridge, IndexNow push as a crawler trigger. These operations are not observer bias at the margin: they are the programme. v2.0 makes the design honest.

The programme operates in three temporally overlapping phases. Phase 1 (May → Sep 2026): active pre-launch phase. Pre-registered interventions on identity surfaces (Wikidata co-occurrence, Zenodo DOI cadence, Common-Crawl optimisation, machine-readable identity surfaces ai.txt + about.txt, Reddit karma building, Cross-LLM Trust Graph) run in parallel with instrument validation of the measurement surfaces (test-retest reliability, intra-set consistency, coverage, model drift). Each intervention triggers an interrupted-time-series window on the affected surfaces; confounds between parallel interventions are explicitly reported. Phase 2 (Sep 2026 → Q3 / 2027): post-launch effect detection. The book launch on 22 September 2026 is the central deliberate intervention; aggregated effect measurement across all surfaces, long-tail observation of AI-answer reach. Phase 3 (from Q3 / 2027): long-term controlled experiments on the then-validated apparatus — with effect measures fitting an n-of-1 design (interrupted time series, Bayesian structural time series, hierarchical Bayesian models), not pre/post Cohen’s d on individual actions.

The epistemic value is not statistical generalisability; it lies in the external auditability of the methodology and in its applicability to other author identities — genre colleagues who wish to quantify their own visibility mechanics, as well as practitioners of search and answer engine optimisation (SEO / AEO / GEO) who seek a reproducible measurement basis rather than anecdotal claims.

2

Research Questions and Lines of Inquiry

How does the machine read an author? In Phase 1: what measurement properties do the instruments have with which we wish later to answer this question — how reliable, how drifting, how consistently operationalised; and what effects are measurable through active pre-launch interventions Q0–Q6. In Phase 3: on the validated apparatus, which levers of author visibility empirically carry how much weight?

2.1  Line 1 — Citation Inventory

Question. What does each measurement instrument show today of the defined author identity? Coverage per surface (knowledge-graph cards, AI answer areas, answer engines, classical search engine result pages) and per identity cluster (person, work, genre, world-mechanic). In Phase 1 descriptive, not inference-oriented.

2.2  Line 2 — Measurement-Instrument Validation

Question. How reliable is each measurement surface, with what drift characteristic does it respond to model updates and platform policy, and which surfaces are redundant with one another? Operationalised via test-retest correlation, intra-set consistency (Cronbach’s α), CUSUM drift detection, and inter-surface agreement.

2.3  Line 3 — Codebook Iteration

Question. Which operationalisation of "correct citation" is robust to edge cases, surface differences, and language-model style drift? Versioned annotation schema v0.x → v1.0, with inter-rater agreement via external annotators from Q4 / 2026.

2.4  Line 4 — Open Materials

Question. Are all findings externally auditable and reproducible for other author identities? What obstacles arise in method transfer, which components generalise as tool packages?

2.5  Line 5 — Active Intervention Registration (new in v2.0)

Question. Which deliberate actions are executed on which identity surfaces, with what temporal marker, and which measurement surfaces are tested in which interrupted-time-series window? Implicit before v2.0, now explicitly pre-registered: each intervention receives its own Q-number (Q0–Q6+) with YAML specification in Section 3.5, stopping rule, surface mapping, stop criteria, reporting plan. Confounds between parallel interventions are explicitly named; holdout periods are planned wherever methodologically tenable.

3

Data Sources and Operationalisation

Table 1 summarises sources and survey cadences. All endpoints are addressed with the programme-wide user agent marin-t-kael:research-tooling:v0.1 (by /u/marintkael). Response headers are archived. Rate-limit signals (429, 503) trigger exponential backoff.

Measurement surface Endpoint type Cadence Validation method
Wikidata Entity (author)SPARQL · RESTdailyreplication after 24 h
Wikidata Entity (book)SPARQL · RESTdailyreplication after 24 h
Google Knowledge Graph SearchAPIdaily (since 2026-05-14)replication after 24 h
Bing Webmaster · AI indexingAPIdailyCUSUM on hit-rate
Google Search ConsoleAPIdailyreplication after 24 h
Google AI Overviewsbrowser snapshotweeklyinter-snapshot agreement
Goodreads / HardcoverGraphQL + HTML snapshotdaily (since 2026-05-14)replication after 24 h
Reddit · public JSONHTTP GETafter each postsnapshot hash comparison
Language-model probe (Gemini, Claude)browser snapshotweeklyreplication after 24 h · model-version log
Cross-LLM Trust Graph (v2.0 · 11-LLM stack from v2.7)source-attribution parser on 11-LLM answers (3 Anthropic tiers Haiku 4.5 / Sonnet 4.6 / Opus 4.7 via Message Batches API · OpenAI gpt-4o-mini + Search-Preview · Gemini 2.5 Flash · 5 Cloudflare Workers AI: Mistral 7B + Llama 3/3.1/3.2 + Phi-2)daily (sync) · async-batch for Anthropic (max 24 h, typical <1 h)12 source patterns × trust weight (+2 / +1 / 0 / −1)
Common-Crawl Snapshot Inclusion Probe (v2.0)CC index API + domain crawlmonthly (per snapshot)URL inclusion rate · page coverage
Machine-readable identity surfaces (v2.0)HTTP GET server logs on /llms.txt /ai.txt /about.txtdailycrawler user-agent histogram per bot ID
Outbound drafts of the pipelinefile streambefore each dispatchlinter check against style-sheet
Table 1 Eleven measurement surfaces plus outbound pipeline, with endpoint type, survey cadence, and validation method. Three surfaces (Cross-LLM Trust Graph, Common-Crawl Probe, Identity Surfaces) are newly added in v2.0.

The canonical truth for the assessment of all lines is the versioned style-sheet of the programme. It contains the fixed person, place, and world-mechanic strings as well as the excluded anti-patterns. Each survey is logged with the style-sheet version at the time of the survey.

4

Pre-Registration Protocol

Before the start of each survey, a pre-registration is published on this site. It contains the following fields in machine-readable form (YAML):

id: prereg-q0-wikidata-to-google-kg-latency
field: 1
hypothesis: |
  Structured statements from Wikidata reach the Google Knowledge Graph
  within ≤14 days of entity publication.
operationalisation:
  source: wikidata.org/entity/Q140004504
  comparator: kgsearch.googleapis.com (Knowledge Graph Search API)
  measurement: ratio (matched canonical statements / total queried)
sampling:
  start: 2026-05-11
  end: 2026-09-22
  cadence: daily 04:00 UTC
stopping_rule: 134 surveys or saturation (no new hit in 14 days)
analysis: descriptive statistics · visualisation of latency distribution
version: v1.0 · 2026-05-10
Figure 1 Example of a pre-registration file. Full pre-registrations are published as a YAML appendix to the respective quarterly report.

Subsequent deviations from the pre-registered plan are permissible but must be openly disclosed in the report. In disputed cases the pre-registered plan counts as the hypothesis that was to be tested.

4.1  Active Interventions Q0–Q6 (new in v2.0 · Q6 in v2.3)

v2.0 makes explicit what ran implicitly in v1.x: Phase 1 contains deliberate interventions on identity surfaces. Each intervention is its own pre-registration with Q-number, hypothesis, stopping criterion, and effect-detection window. Full YAMLs are an appendix to the next quarterly report (Q3 / 2026); here a compact overview.

Q-ID Hypothesis (short) Identity surface / action Effect measurement surface Status Start
Q0 Wikidata statements reach Google KG within ≤14 days Wikidata curation (Q140004504 + Q140004740) Google KG Search API · KG score active (latency ≈ 40 h observed T+0 → T+2) 2026-05-11
Q1 P136 genre statements (Q3294789 High Fantasy etc.) increase co-occurrence in the LLM cluster Wikidata Co-Occurrence Engineering (6–8 P-statements) Cross-LLM Trust Graph · CompCluster score registered (execution Q3-2026) 2026-05-14 (planned)
Q2 Inclusion in CC-MAIN-2026-21 (May snapshot) increases LLM citation score in the next model cycle Common-Crawl optimisation (/llms.txt /ai.txt /about.txt + backlinks) Cross-LLM Trust Graph from Q4 / 2026 (LLM re-training lag) active (files deployed T+2) 2026-05-13
Q3 Source-attribution profile per LLM drifts < 1 trust point over 90 days (stability anchor for ITS) Cross-LLM Trust Graph tracking (12 patterns × 11 LLMs from v2.7) ai_citation_sources · CUSUM on trust-score mean active (live from T+3 after cron 04:00 UTC) 2026-05-14
Q4 Reddit comment karma > 200 in 6 subreddits in 90 days increases mention-cluster visibility Reddit karma building (1× substantive comment / sub / week) Reddit mention snapshots · Cross-LLM Trust Graph cluster "Research" active (running in r/Fantasy since T+0; 7 more subs from T+3) 2026-05-11
Q5 Zenodo DOI cadence (1 MN / quarter) triggers Wikipedia notability threshold crossing Zenodo DOI salvo (MN-01 v1.x + v2.x, MN-02 Q3, MN-03 Q4) Wikipedia article-existence probe (CC-MAIN coverage) registered (MN-01 v2.0 live T+2; MN-02 ca. 2026-07-15) 2026-05-13
Q6 v2.3 Consistent reader activity on Hardcover (reviews + mark-as-read + want-to-read) produces a reader-authenticity signal that strengthens cross-linking to Goodreads and the LLM trust cluster “reading community” Reader-account activity volume on Hardcover (reviews, mark-as-read, want-to-read) Hardcover snapshot pipeline · books_read · reviews_written · cross-LLM trust graph cluster “reading community” active (since T+3, low-volume sustained) 2026-05-14
Table 2 Seven active pre-registrations (Q0–Q5 v2.0, Q6 v2.3). All formally independent but run in temporal parallel — inter-Q confounds are explicitly named in each quarterly report. Holdout periods are planned wherever methodologically tenable (Q4 has, for example, a 30-day pause between sub onboardings for karma-build isolation; Q6 runs low-volume sustained without bursts).

5

Measurement Instrumentation

5.1  Style-Sheet as Canonical Truth

The programme-internal style-sheet defines, for every operationalised statement, a canonical string, a list of explicitly impermissible anti-patterns, and a severity grade. It is maintained under versioning; changes are traceable, every entry carries an effective date and a source reference.

5.2  Linter

The programme-wide linter (source file style_lint.py, MIT-licensed) checks outbound drafts against the style-sheet and against platform-specific rules (e.g. Reddit's 9:1 self-promotion recommendation). It returns a classification into four severity grades. S1 and S4 findings are blocking; S2 produces a warning; S3 is logged.

5.3  Assessment Metrics and Statistical Methods per Line

Table 2 summarises the primary and secondary metrics and the statistical procedures used per Phase-1 line. The full a-priori power analyses and stop thresholds are part of the respective pre-registration of the individual survey. Methodological inventory for Phase 3 (long-term controlled experiments from Q3 / 2027) — Cohen’s d, Bayes-factor stop procedure, a-priori power analysis, decay fit, ITS/BSTS — is documented in §§ 5.4 to 5.7 as a preview.

Line Primary metric Secondary metrics Statistical procedure
1 Hit-rate H = correct citations / queries (descriptive) Coverage quota per identity cluster (person, work, genre, world-mechanic); hallucination rate Descriptive statistics; Wald 95-% CI on H; coverage heatmap per surface × cluster (no inference on action-effect in Phase 1)
2 Test-retest correlation r (24-h replication) Cronbach’s α intra-query-set; ICC (Intraclass Correlation) inter-snapshot; model-version logs Pearson r with bootstrap 95-% CI; thresholds r ≥ 0.9 (API), 0.7 (LLM probe); CUSUM charts (h = 5) for drift detection; model-update markers
3 Inter-rater agreement Cohen’s κ (from Q4 / 2026) Edge-case coverage of the schema; annotation difference per schema version Cohen’s κ with bootstrap CI; threshold κ ≥ 0.7 as codebook version release; Krippendorff’s α as sensitivity check
4 Replication success rate by external auditors Data-pin completeness, environment.yml executability Qualitative 3rd-party audit report per replication archive; issue tracker on GitHub for reproduction failures
Table 2 Phase-1 assessment metrics and statistical procedures per line. Before each survey, the table is detailed for the concrete survey plan in the pre-registration.

5.4  Sequential Bayes-Factor Test (Phase-3 inventory)

For the long-term controlled experiments of Phase 3 (from Q3 / 2027), the Bayes factor BF₁₀ is updated by sequential collection. Stop thresholds are pre-specified: BF₁₀ > 10 confirms the hypothesis, BF₁₀ < 1⁄10 rejects it. Between, the survey continues until a threshold is reached or the sample maximum is exhausted. The procedure is documented here in Phase 1 as methodological inventory; it is activated only on the validated apparatus.

BF₁₀ = 1/10 BF₁₀ = 10 H₀ confirmed (effect rejected) continue sampling H₁ confirmed (effect demonstrated) Prior Posterior d̂ = 0.62 −1.0 0 1.0 2.0 Effect size Cohen’s d Posterior density
Figure 1 Posterior distribution of the effect size for an example hypothesis. The survey continues until BF₁₀ crosses one of the two thresholds — then stop. At the state shown (posterior mean d̂ = 0.62), the survey would not yet end; the stop threshold at BF₁₀ = 10 (corresponding to d ≈ 0.85 under this distribution) is not reached.

5.5  CUSUM Charts for Drift Detection (Phase-1 core)

For Line 2 (Measurement-Instrument Validation), a cumulative sum chart (CUSUM) detects systematic shifts of the hit-rate of a measurement surface against baseline. As soon as the chart exceeds a pre-specified alarm threshold, the affected surface is marked, the drift cause (typically: model update or platform change) is identified, and recorded in the model-version log. CUSUM is the central Phase-1 tool because it detects instrument drift early, before it is misinterpreted as an "effect observation".

Si = 0 h = 5 (alarm threshold) Drift detected · Day 36 — survey halted 0 10 20 30 40 50 60 Survey day 7 5 2 0 −3 CUSUM Si
Figure 2 CUSUM trajectory of a hit-rate survey: until day ~32 the measurement procedure remains stable around Si = 0. From day 32, cumulative deviation builds up — on day 36 Si exceeds the pre-specified alarm threshold h = 5, the survey is halted. Example: a model update at one of the observed language models has structurally altered the hit-rate.

5.6  A-priori Power Analysis (Phase-3 inventory)

For the long-term controlled experiments of Phase 3, the required sample size for a power of 1 − β = 0.80 at α = 0.05 is calculated before each survey — depending on the expected effect size. Large effects (d = 0.8) require roughly 125 observations to be detected at 80 % power; medium effects need ~310; small effects are barely securely detectable with ≤ 600 observations. The procedure is activated only on the validated apparatus from Q3 / 2027.

Power = 0.80 d = 0.8 (large) d = 0.5 (medium) d = 0.2 (small) ~125 ~310 0 100 200 300 400 500 600 Sample size N (surface × query × day) 1.0 0.8 0.5 0 Statistical power 1 − β
Figure 3 Power curves for three effect classes at α = 0.05. To detect a large effect (d = 0.8) with 80 % power, roughly 125 observations are sufficient; a medium effect needs ~310; a small effect is barely securely detectable with ≤ 600 observations. The sizing of survey windows (Line 2) is set against this.

5.7  Latency and Half-Life Comparison (Phase-2 inventory)

Expected effect profiles of action classes differ widely. Fast platforms (IndexNow, Reddit) show effects within hours to days; identity-based levers (ORCID updates, Wikidata edits) need indexing cycles of days to weeks. This a-priori expectation is methodological inventory for the Phase-3 long-term controlled experiments from Q3 / 2027 and will then be mirrored against real measurements.

IndexNow bulk push Reddit post Newsletter dispatch Wikidata edit ORCID profile update Manuscript indexing 0 5 10 15 20 25 30 Days to effect onset (expected)
Figure 4 · Phase-2 preview Expected latencies to effect onset for six action classes. Box: interquartile range (median marked); whiskers: expected full range. Fast platforms (IndexNow, Reddit) act within days; identity-based levers (ORCID, manuscript) need weeks. This a-priori expectation will, from Q3 / 2027 — on the then-validated apparatus — be mirrored against real measurements and revised if necessary.

6

Reproducibility

Every publication of the programme is accompanied by a replication archive. The archive contains:

  1. the measurement code as executable Python scripts (MIT licence) — with frozen version pins via environment.yml;
  2. the raw measurement values as JSONL, including timestamp, endpoint URL, HTTP status, and response headers;
  3. the style-sheet in the state of the survey;
  4. a short README.md with reproduction instructions.

Pre-versions of the code are made available to external auditors on request before the first regular publication; the address is research@marin-t-kael.de.

7

Limitations

The programme's construction has methodological limitations, named here in advance so that they are not mistaken for a finding.

  1. Single-case study. The observation encompasses one work, one author, one language (DE primary, EN secondary). Findings are not generalisable; they connect to other carefully documented single cases.
  2. Conflict of interest. Programme lead and subject of inquiry are identical. The practice of pre-registration and the public failure log mitigate the risk of publication bias; they do not eliminate it.
  3. Endpoint volatility. Search and answer systems change during the active pre-launch period (algorithms, model versions, indexing logic). Versioned endpoints are addressed where possible; where not, the version state at the time of the survey is logged.
  4. Observation effect. The public programme changes the observed: platform reviewers, reader communities, and knowledge-graph editors may be influenced by the publication. The effect is qualitatively discussed in the quarterly report.
  5. Inter-Q confound (new in v2.0). Seven active pre-registrations (Q0–Q6) run temporally parallel on partly overlapping measurement surfaces. Isolated effect attribution of a single Q is therefore mostly not possible; instead, effects are reported as combined movement of the measurement surface with a plausibility discussion of which Q probably contributed. Holdout periods are planned wherever methodologically tenable (see Q4 as example). The Phase-3 apparatus (Q3 / 2027+) will be able to reduce confounds through single-subject reversal designs.

8

Ethical Commitments

  1. No collection of personal data of third parties beyond what is publicly available.
  2. No aggregation or transfer of reader material to third parties; no use as training data for models.
  3. Full compliance with platform policy; violations are openly logged and corrected.
  4. Citation of individual contributions only with explicit consent of the writer.
  5. On request of a platform, data retrieval is immediately discontinued and the instruction is published as a finding.

9

Publication Cadence

Quarterly reports appear in mid-October, January, April, and July, with a tolerance of ±two weeks. Methodology notes appear ad hoc on substantive changes to the measurement frame. Pre-registrations appear before the start of each new survey. A field report supplements the programme with specific topics outside the quarterly logic.

10

Citation

Kael, M. T. (2026). Baseline Measurement: Author Identity in the Citation Behaviour of Language Models (Pre-Launch). Methodology Note 01, Marin T. Kael — AI Citation Behaviour Lab. Marin T. Kael, Independent. Version 3.0 (DE + EN bilingual, 18 May 2026 · Methodology Note now ships both language PDFs in a single Zenodo record for full international reach in the GEO/AEO community + arXiv audience. PDFs consistent with v2.9 content (bilingual bundling, aggregate methodology tightened, Gemini Direct Batch integrated). Consistency bump v2.9 → v3.0 reflects that this is the first fully bilingual-conformant Methodology Note record). Version DOI: 10.5281/zenodo.20308495 · Concept DOI (version-stable): 10.5281/zenodo.20125933 · Predecessor v2.9 (superseded): 10.5281/zenodo.20262237.

BibTeX
@techreport{kael2026citation,
  author      = {Kael, Marin T.},
  title       = {Baseline Measurement: Author Identity in the Citation Behaviour of Language Models (Pre-Launch)},
  institution = {Marin T. Kael, Independent},
  type        = {Methodology Note},
  number      = {01},
  series      = {Marin T. Kael — AI Citation Behaviour Lab},
  year        = {2026},
  month       = {5},
  version     = {v3.0},
  language    = {german + english (bilingual)},
  doi         = {10.5281/zenodo.20308495},
  url         = {https://doi.org/10.5281/zenodo.20125933},
  note        = {v2.8 methodological extension (2026-05-17 evening): aggregate methodology tightened — API call failures (quota exhaustion, HTTP 5xx, timeouts) that previously returned null and were silently scored as 'not_found' / score=0 are now propagated as status='error' and excluded from the aggregate denominator. Per-LLM rows now carry n_attempted / n_legit / n_error, pure-error LLMs flagged 'unavailable'. Retroactive re-aggregation of 41 historical snapshots, mean delta +10.64 pp between pre-fix and post-fix aggregates. Both values persisted in parallel via D1 Migration 0013 (score_percent / score_percent_v271) for full audit trail. v2.8 also adds Gemini 2.5 Flash via the Gemini Batch API direct (analog to API Batches, API-key-based, no Service-Account-JWT-signing). Predecessor v2.7 (DOI 10.5281/zenodo.20258380, superseded) · Concept-DOI: 10.5281/zenodo.20125933 (version-stable).},
}