Activity Report · Q3 / 2026 · Preview
Active Pre-Launch Phase — First 90 Days Q0–Q5 + Parallel Measurement Apparatus.
Phase-1 activity report: six pre-registrations Q0–Q5 with parallel instrument validation across eleven measurement surfaces (preview layout).
Abstract
Over the pre-launch window from 11 May to 22 September 2026, eight measurement surfaces were sampled daily across 12 pre-registered query sets — totalling 13,440 samples plus 1,680 24-h replication samples. Phase-1 findings: API-based measurement surfaces (Wikidata, Google Knowledge Graph, Google Search Console) reached test-retest reliability r ≥ 0.90; Bing AI lay at r = 0.76 with a CUSUM drift alarm on day 36 (Bing AI model update on 14 July); language-model browser probes (Gemini, Claude) stayed at r = 0.58 and 0.64 respectively, below the Phase-2 threshold. Codebook v0.1 completed, edge-case discussion open; external-annotator pipeline scheduled for Q4.
Keywords measurement-instrument validation · test-retest reliability · CUSUM drift detection · coverage mapping · codebook iteration · Phase 1 · AI search
1
Survey Overview
Active pre-launch window: 11 May 2026 to 22 September 2026 (T+0 = 11 May). Sample: 8 measurement surfaces × 12 query sets × 140 days daily, plus 24-h replication probes per measurement surface over 14 randomly chosen survey days; totalling 13,440 daily samples plus 1,680 replication samples. Methodology per Methodology Note 01.
The quarter interleaved Q0–Q5 intervention tracking with parallel instrument validation (Phase 1 · Active Pre-Launch). Six pre-registrations ran in parallel; their effect detection proceeds via interrupted-time-series windows on the affected measurement surfaces; inter-Q confounds are explicitly disclosed. Actions were recorded as deterministic marker events to distinguish drift indications from action-induced shifts — a statement about their actual effect is explicitly not made before apparatus validation in Phase 1.
2
Line 1 — Citation Inventory
Coverage mapping of the author identity by identity cluster on the reference date 22 September 2026: person and work clusters are well established in structured sources (Wikidata, Goodreads, Hardcover); genre and world-mechanic clusters remain weakly represented across all measurement surfaces.
Here: Coverage difference matrix with hit-rate change per measurement surface × cluster over the 90-day window · time-series plots of selected surfaces · CUSUM snapshots for drift detection. In the preview layout represented by the coverage heatmap and the drift profile on /en/research.
3
Line 2 — Measurement-Instrument Validation
Test-retest reliability, intra-set consistency, and CUSUM drift statistics per measurement surface after 90 days. Table 1 summarises the primary findings.
| Measurement surface | Test-retest r | 95-% CI | α intra-set | CUSUM alarm | Validation status |
|---|---|---|---|---|---|
| Wikidata (SPARQL) | 0.96 | [0.94; 0.98] | 0.93 | no | validated |
| Reddit (public JSON) | 0.94 | [0.91; 0.97] | 0.89 | no | validated |
| Google Search Console | 0.92 | [0.89; 0.95] | 0.87 | no | validated |
| Google Knowledge Graph | 0.88 | [0.83; 0.93] | 0.82 | no | acceptable |
| Goodreads / Hardcover | 0.85 | [0.80; 0.90] | 0.78 | no | acceptable |
| Bing Webmaster AI | 0.76 | [0.68; 0.84] | 0.73 | Day 36 (model update) | acceptable, drift-burdened |
| Gemini (browser probe) | 0.64 | [0.53; 0.75] | 0.71 | no | below Phase-2 threshold |
| Claude (browser probe) | 0.58 | [0.45; 0.71] | 0.67 | no | below Phase-2 threshold |
Bing AI showed a CUSUM alarm on day 36 (14 July 2026) at h = 5; the drift coincided with a Microsoft-announced Bing AI model update and was annotated accordingly. The language-model browser probes (Gemini, Claude) remained continuously below the Phase-2 reliability threshold — for Phase 2, either the survey methodology must be adapted (e.g. multiple snapshots per survey day with aggregation) or the surfaces must be excluded from the effect-measurement set.
4
Line 3 — Codebook Iteration
Annotation schema v0.1 completed on 30 June 2026; edge-case discussion on three contested patterns (paraphrased citations, partially correct work titles, pseudonym mentions without work reference) public on GitHub Issues.
- Schema versioning
Codebook v0.1 (initial) → v0.2 scheduled for 30 November 2026 after evaluation of the external- annotator pilot round. Inter-rater agreement (Cohen’s κ) is reported for the first time in the Q4 report.
in progress
- Edge-case sample
48 edge cases from the 90 days were collected and publicly discussed in the style-sheet annotation appendix; of these, 31 were classified as "correct citation", 12 as "partially correct", 5 as "hallucination".
published
5
Discussion and Limitations
Phase 1 is half complete after 90 days, with the book launch 45 days away. The API-based measurement surfaces show the expected deterministic reliability; the language-model probes (Gemini, Claude) are too variable in their current form for pre-registered effect studies. Bing AI becomes, through the July model update, a drift-demonstration surface — instructive for the methodology, but an indication that Phase-2 post-launch effect detection needs tight CUSUM watchers.
Limitations: (i) single-case study without a comparison identity, hence no separation of identity-specific vs. structural effects; (ii) the annotation schema has so far been kept solely by the programme lead, inter-rater agreement follows only in Q4; (iii) browser-snapshot probes (Gemini, Claude, Google AI Overviews) depend on platform-UI stability — a UI redesign can break the probe pipeline without the measurement logic failing.
6
Pre-Registrations for Q4 / 2026
Three pre-registered validation hypotheses will be collected from 23 September 2026 and evaluated in the January report (Q4 / 2026):
- H-Q4-INST-01
Inter-rater agreement (Cohen’s κ) for codebook v0.2 between the programme lead and two external annotators reaches κ ≥ 0.7 on a sample of N = 200 probe annotations.
- H-Q4-INST-02
Improvement of language-model probes through n=5 multiple snapshots per survey day with median aggregation raises the test-retest reliability r for Gemini and Claude above the Phase-2 threshold 0.7 (Power = 0.80 at expected Δr = 0.12).
- H-Q4-INST-03
CUSUM charts with alarm threshold h = 5 detect AI model updates (Bing AI, Gemini, Claude) in the 90-day window with sensitivity ≥ 0.80 — measured against publicly communicated model-version releases.
7
Open Materials
With the final report there will appear a replication archive
(Zenodo DOI) containing: all 13,440 daily samples plus 1,680
replication samples as versioned JSON snapshots, all validation
and aggregation scripts with frozen pins
(environment.yml), pre-registration documents in OSF
format, codebook v0.1 as a snapshot, and the versioned style-sheet
at the time of the survey.
Raw data under CC 0 (where platform terms permit). Source code under MIT at github.com/marintkael/marin-research-tools.
Citation (planned form): Kael, M. T. (2026). Active Pre-Launch Phase — First 90 Days Q0–Q5 + Parallel Measurement Apparatus. Activity Report Q3 / 2026, Marin T. Kael — AI Citation Behaviour Lab. DOI on publication on 15 October 2026.