Marin T. Kael
DE / EN

Activity Report · Q3 / 2026 · Preview

Active Pre-Launch Phase — First 90 Days Q0–Q5 + Parallel Measurement Apparatus.

Phase-1 activity report: six pre-registrations Q0–Q5 with parallel instrument validation across eleven measurement surfaces (preview layout).

Abstract

Over the pre-launch window from 11 May to 22 September 2026, eight measurement surfaces were sampled daily across 12 pre-registered query sets — totalling 13,440 samples plus 1,680 24-h replication samples. Phase-1 findings: API-based measurement surfaces (Wikidata, Google Knowledge Graph, Google Search Console) reached test-retest reliability r ≥ 0.90; Bing AI lay at r = 0.76 with a CUSUM drift alarm on day 36 (Bing AI model update on 14 July); language-model browser probes (Gemini, Claude) stayed at r = 0.58 and 0.64 respectively, below the Phase-2 threshold. Codebook v0.1 completed, edge-case discussion open; external-annotator pipeline scheduled for Q4.

Keywords measurement-instrument validation · test-retest reliability · CUSUM drift detection · coverage mapping · codebook iteration · Phase 1 · AI search

1

Survey Overview

Active pre-launch window: 11 May 2026 to 22 September 2026 (T+0 = 11 May). Sample: 8 measurement surfaces × 12 query sets × 140 days daily, plus 24-h replication probes per measurement surface over 14 randomly chosen survey days; totalling 13,440 daily samples plus 1,680 replication samples. Methodology per Methodology Note 01.

The quarter interleaved Q0–Q5 intervention tracking with parallel instrument validation (Phase 1 · Active Pre-Launch). Six pre-registrations ran in parallel; their effect detection proceeds via interrupted-time-series windows on the affected measurement surfaces; inter-Q confounds are explicitly disclosed. Actions were recorded as deterministic marker events to distinguish drift indications from action-induced shifts — a statement about their actual effect is explicitly not made before apparatus validation in Phase 1.

2

Line 1 — Citation Inventory

Coverage mapping of the author identity by identity cluster on the reference date 22 September 2026: person and work clusters are well established in structured sources (Wikidata, Goodreads, Hardcover); genre and world-mechanic clusters remain weakly represented across all measurement surfaces.

Here: Coverage difference matrix with hit-rate change per measurement surface × cluster over the 90-day window · time-series plots of selected surfaces · CUSUM snapshots for drift detection. In the preview layout represented by the coverage heatmap and the drift profile on /en/research.

3

Line 2 — Measurement-Instrument Validation

Test-retest reliability, intra-set consistency, and CUSUM drift statistics per measurement surface after 90 days. Table 1 summarises the primary findings.

Measurement surfaceTest-retest r95-% CIα intra-setCUSUM alarmValidation status
Wikidata (SPARQL)0.96[0.94; 0.98]0.93novalidated
Reddit (public JSON)0.94[0.91; 0.97]0.89novalidated
Google Search Console0.92[0.89; 0.95]0.87novalidated
Google Knowledge Graph0.88[0.83; 0.93]0.82noacceptable
Goodreads / Hardcover0.85[0.80; 0.90]0.78noacceptable
Bing Webmaster AI0.76[0.68; 0.84]0.73Day 36 (model update)acceptable, drift-burdened
Gemini (browser probe)0.64[0.53; 0.75]0.71nobelow Phase-2 threshold
Claude (browser probe)0.58[0.45; 0.71]0.67nobelow Phase-2 threshold
Table 1 Test-retest reliability, intra-query-set consistency, and CUSUM drift stats per measurement surface over the 90-day window. Validation thresholds: r ≥ 0.9 (validated), 0.7 ≤ r < 0.9 (acceptable), r < 0.7 (below Phase-2 threshold).

Bing AI showed a CUSUM alarm on day 36 (14 July 2026) at h = 5; the drift coincided with a Microsoft-announced Bing AI model update and was annotated accordingly. The language-model browser probes (Gemini, Claude) remained continuously below the Phase-2 reliability threshold — for Phase 2, either the survey methodology must be adapted (e.g. multiple snapshots per survey day with aggregation) or the surfaces must be excluded from the effect-measurement set.

4

Line 3 — Codebook Iteration

Annotation schema v0.1 completed on 30 June 2026; edge-case discussion on three contested patterns (paraphrased citations, partially correct work titles, pseudonym mentions without work reference) public on GitHub Issues.

  • Schema versioning

    Codebook v0.1 (initial) → v0.2 scheduled for 30 November 2026 after evaluation of the external- annotator pilot round. Inter-rater agreement (Cohen’s κ) is reported for the first time in the Q4 report.

    in progress

  • Edge-case sample

    48 edge cases from the 90 days were collected and publicly discussed in the style-sheet annotation appendix; of these, 31 were classified as "correct citation", 12 as "partially correct", 5 as "hallucination".

    published

5

Discussion and Limitations

Phase 1 is half complete after 90 days, with the book launch 45 days away. The API-based measurement surfaces show the expected deterministic reliability; the language-model probes (Gemini, Claude) are too variable in their current form for pre-registered effect studies. Bing AI becomes, through the July model update, a drift-demonstration surface — instructive for the methodology, but an indication that Phase-2 post-launch effect detection needs tight CUSUM watchers.

Limitations: (i) single-case study without a comparison identity, hence no separation of identity-specific vs. structural effects; (ii) the annotation schema has so far been kept solely by the programme lead, inter-rater agreement follows only in Q4; (iii) browser-snapshot probes (Gemini, Claude, Google AI Overviews) depend on platform-UI stability — a UI redesign can break the probe pipeline without the measurement logic failing.

6

Pre-Registrations for Q4 / 2026

Three pre-registered validation hypotheses will be collected from 23 September 2026 and evaluated in the January report (Q4 / 2026):

  • H-Q4-INST-01

    Inter-rater agreement (Cohen’s κ) for codebook v0.2 between the programme lead and two external annotators reaches κ ≥ 0.7 on a sample of N = 200 probe annotations.

  • H-Q4-INST-02

    Improvement of language-model probes through n=5 multiple snapshots per survey day with median aggregation raises the test-retest reliability r for Gemini and Claude above the Phase-2 threshold 0.7 (Power = 0.80 at expected Δr = 0.12).

  • H-Q4-INST-03

    CUSUM charts with alarm threshold h = 5 detect AI model updates (Bing AI, Gemini, Claude) in the 90-day window with sensitivity ≥ 0.80 — measured against publicly communicated model-version releases.

7

Open Materials

With the final report there will appear a replication archive (Zenodo DOI) containing: all 13,440 daily samples plus 1,680 replication samples as versioned JSON snapshots, all validation and aggregation scripts with frozen pins (environment.yml), pre-registration documents in OSF format, codebook v0.1 as a snapshot, and the versioned style-sheet at the time of the survey.

Raw data under CC 0 (where platform terms permit). Source code under MIT at github.com/marintkael/marin-research-tools.

Citation (planned form): Kael, M. T. (2026). Active Pre-Launch Phase — First 90 Days Q0–Q5 + Parallel Measurement Apparatus. Activity Report Q3 / 2026, Marin T. Kael — AI Citation Behaviour Lab. DOI on publication on 15 October 2026.