Research + design proposal · cortextOS / SiteSmith · 2026-06-11 · research only — nothing implemented; every change below awaits your decision · built from 11 research agents + 2 Codex cross-checks + live-code verification
The "three sales decks, got the month-old one" problem is not a storage or "too much old data" problem — it's a retrieval bug: the system ranks by topic-similarity only, has no idea which version is current, and a hidden de-duplicator can silently drop the newest copy. I verified all of this in the live code. The "trash can for dead data" you described is a real and worthwhile second layer — fully designed below — but it should come after the ranking fix, because on an 8-agent system an auto-delete engine creates more review work than it saves.
You framed this as a data-hoarding / lifecycle problem: too much stale stuff, need a relevance-decay upgrade and a review-to-delete "trash can." We researched exactly that (how data centers, RAG systems, and records-management products do it — section 7). But when the research agents went into the actual cortextOS retrieval code to ground the design, they found the headline symptom (the deck miss) is caused upstream of any lifecycle system:
So the work splits cleanly into two layers: (A) fix retrieval so "latest/current X" is correct and the newest version isn't silently dropped — small, high-leverage, fixes your actual complaint; and (B) the broader stale-content lifecycle + reviewable trash can — valuable, fully designed, but phased after (A).
The research agents read the live retrieval path (knowledge-base/scripts/mmrag.py + its three callers) and reproduced the failure. I independently re-verified every code claim below against the actual file — line numbers and behavior confirmed.
cmd_query (mmrag.py:1153) orders results purely by cosine similarity — confirmed by reading the code: there is no recency or canonical term anywhere in the ranking (a live query came back ordered by descending similarity, best matches in the ~0.66–0.68 range). "Most recent" is never computed.
deduplicate_results (mmrag.py:1128, >0.85 word-overlap on the first 500 chars) collapses near-duplicate results and keeps results[0] = the highest-semantic member, and it runs before the top-K trim. In a controlled test, three boilerplate-heavy "decks" (overlap 1.0) collapsed to a single survivor chosen by wording score — the newest was discarded inside Python before the agent ever saw it. (Note: it keeps the highest-semantic, which is recency-blind and can drop the newest — not literally "always the oldest." Also: 0.85 overlap on the first 500 chars can collapse monthly decks that differ only in later slides.)
All three callers apply a 0.5 similarity threshold before any rerank could recover a result. The live best legitimate match scored only 0.66–0.68 — little headroom — so a correct-but-slightly-off-topic newer version can be filtered out entirely first.
file_id = md5(path)[:12] + _chunk{i}, mmrag.py:489). "Most recent deck" is a document-level question — so ranking must group chunks → source before applying recency, or one strong old chunk beats the right new document. (Codex flagged this as the actual crux.)ingested_at exists on every chunk but it's re-ingest time — re-ingesting a month-old deck yesterday stamps it "fresh." Decay must use created_at from frontmatter or git, never ingest-time (this is the same trap as LangChain/LlamaIndex's access-time decay rerankers).Codex split this into the minimal core that fixes your complaint vs. the operational layer that's nice-to-have. Reflected below.
cmd_query between the threshold filter and the de-duplicator: blended = α·similarity + β·recency_decay(created_at) + γ·canonical_boost → re-sort → then dedup (now keeps the freshest/canonical as results[0]) → then trim. This is the only location that works — the TypeScript layer only ever sees the already-trimmed five.Step 0 before any of this: re-run the deck query against your real corpus at high top-K and threshold 0 to confirm which mode fires for your docs, and commit that as the regression baseline.
set-meta-by-source primitive (doesn't exist): update every chunk of a source's metadata in place, with a dry-run + exact chunk-count audit (partial metadata = worse ranking). Prerequisite for both canonical tagging and any future soft-delete.doc_family, version, status, is_canonical, current, supersedes/superseded_by, created_at, effective_date, pinned. Note canonical ≠ current (an evergreen canonical doc is not the same as "newest deck"). Keep the vocabulary small — no enterprise PHI/PCI taxonomies.is_canonical, predecessors superseded. Route ambiguous families to review, never auto-supersede.--type) so callers can pass metadata filters — smaller build than "from scratch."Designed in full below, but do not pre-build it. On a low-query 8-agent brain where git already gives free, complete undo, an automated delete-flagging engine manufactures review burden and a delete-risk surface to solve what is, today, a ranking miss. Build it only when real staleness materializes beyond what V1 fixes.
This is the system you described. The research is unanimous on the shape, and your substrate gives most of it for free:
git rm'd content lives in history (restore via reflog). No new storage infra.This is your hard requirement, and the literature agrees on the method: fuse several weak, independent signals into one score; flag only when they agree; route everything ambiguous to human review; never auto-delete.
| Signal | Strength & honest caveat |
|---|---|
Deterministic supersession (newer version / later effective-date / git timestamp / filename pattern like -v3, -2026-06) | The anchor. Objective, primary-source. Strong enough to stage on its own. This alone solves the three-decks case. |
| Near-duplicate embedding cluster (cosine on the vectors already in Chroma) | Confirms supersession; never used alone (two clients' decks cluster but are distinct). At 6,900 docs brute-force is trivial — skip MinHash/LSH. |
| Cross-reference orphan (nothing links to it) | Conditional: only meaningful if brain files have machine-parseable links. If they're freeform prose, this is unreliable — must validate first. |
| Per-content-type recency decay (event docs expire; core facts never) | On created_at/effective-date, never ingest-time. An old "this cycle's goals" is legitimately expired → flag-for-archive, not delete. |
| Retrieval hit-rate (never returned/used) | Weak here. On a low-traffic 8-agent system, zero hits is mostly low query volume, not death. Long window, corroborator only, never alone. |
The "not a guess" bar, corrected: require agreement of genuinely orthogonal signals. The tempting pair "superseded AND orphaned" is not orthogonal — superseding a doc causes its orphaning (references get re-pointed), so they're correlated by common cause. Anchor on deterministic supersession; fuzzy/embedding-only cases need a real confirmer or go to review.
Date provenance. If a doc's created_at is missing, copied, rebased, or actually edit/import time rather than the deck's real date, V1 returns the wrong "latest" with more confidence than today. Approve V1 only if date provenance is auditable and missing dates are visible in the explain output (degrade to neutral, never fall back to ingest-time).
gemini-embedding-2-preview) that will be deprecated — plan a full one-shot re-embed as a "when," not "if.".md frontmatter (git-diffable, survives re-ingest) and mirror into Chroma, or only in Chroma?mmrag.py owned locally (safe to extend) or pulled from upstream cortextOS (then changes go through the upstream-delivery path)?Four research angles, each web-sourced and adversarially verified. The honest throughline: most hyperscale machinery is a cost tradeoff that doesn't exist on one local box — but the lifecycle and review patterns transfer directly.
Hot/warm/cold/archive tiering (S3 Glacier, Azure tiers) is fundamentally about saving money on rarely-touched bytes across storage with different latencies — none of that economics exists locally, where disk + Chroma are uniformly fast. What transfers, recast for relevance not cost: (1) the lifecycle state machine (active → stale-candidate → staged → deleted); (2) classification/tagging — the direct fix for "which deck is current"; (3) lifecycle rules with the expire action replaced by "stage for review" — exactly your review-before-delete constraint; (4) legal-hold/WORM inverted into a pin flag for canonical docs (git = the free compliance/recovery backstop); (5) Microsoft Purview disposition review as the near-exact blueprint for the reviewable trash can. Verified caveat: Purview offers an optional auto-approval timeout — we deliberately choose the stricter "human approves every purge" setting. Skip DoD 5015.02 / ISO 15489 / Glacier / SEC-WORM as overkill for 8 agents.
Re-centered on your real pain: this is supersession/canonical-version, not age. A month-old doc isn't "old." Critical finding: the off-the-shelf recency rerankers (LangChain TimeWeightedVectorStoreRetriever, LlamaIndex TimeWeightedPostprocessor) decay on last-accessed time and every retrieval refreshes it — so a stale-but-popular deck stays permanently "fresh" and the wrong answer gets entrenched. Decay must run on created_at/effective-date. Build the trash can from verified Chroma mechanics (in-place metadata update, where-filter exclusion). Add a staleness regression test — standard RAG eval (RAGAS) rates a stale-but-faithful answer 0.95 and won't catch this. Right-size: brute-force dedup beats LSH at 6,900 docs.
Separates cleanly into detection (what's dead — your "by signals") vs staging (reviewable trash + reversibility). Every enterprise system researched (Google Vault, M365 retention, Purview, S3 versioning, Notion/Confluence archive) is staging machinery and is time/age-triggered — a retention clock would expire all three decks by elapsed time and never notice deck #3 supersedes #1 and #2. So borrow the safe staging/review/audit lifecycle; build the supersession detector ourselves. Your substrate gives two of three staging tiers free: git (tombstone + audit + undo) and a Chroma "archived" flag (out of active retrieval, recoverable).
Confidence comes from fusing multiple weak signals + a confidence "gray zone" routed to human review (the Dosu / temporal-RAG / fraud-screening pattern). Verified reframe of the three decks: embedding-cluster the decks as one family, pick newest by version/date as canonical, stage the older two. False-positive control is the deliverable (flagging something still-needed is the failure you fear most): per-content-type TTLs, a multi-signal threshold, a gray-zone review queue, and a per-doc "keep-forever" override. Verifier corrections folded in: retrieval-hit-rate downgraded to weak-corroborator at this scale; "N≥2 signals" must be genuinely orthogonal, not two correlated views of "oldness."
Lifecycle / tiering / records:
splunk.com/.../ilm-information-lifecycle-management · docs.aws.amazon.com S3 storage-class-intro / object-lifecycle-mgmt / object-lock-overview / tagging-best-practices · learn.microsoft.com Azure lifecycle-management + access-tiers · cloud.google.com storage-classes + lifecycle · learn.microsoft.com/purview/disposition · esd.whs.mil DoD 5015.02 · encompaas.cloud + zasio.com defensible-disposition
RAG hygiene / decay:
arxiv.org/abs/2510.08109 (Uber Eats semantic search) · milvus.io exponential-decay · python.langchain.com time_weighted_vectorstore + github issue 29306 · docs.llamaindex.ai TimeWeightedPostprocessor · docs.trychroma.com collections/update + filtering · pinecone.io update-delete-by-metadata · arxiv.org/pdf/2503.04800 (HoH benchmark) · medium.com embedding-model-upgrade-cost · atlan.com evaluate-rag · towardsdatascience.com rag-is-blind-to-time
Soft-delete / review:
jamestharpe.com/tombstone-pattern · streamkap.com cdc-soft-deletes · docs.aws.amazon.com DeleteMarker · support.google.com/vault · learn.microsoft.com/purview retention + disposition + preservation-lock · notion.com duplicate-delete-and-restore + custom-data-retention · github.com/chroma-core/chroma issue 3793 · git-tower.com restoring-deleted-files
Staleness detection:
github.com/Emmimal/temporal-rag · dosu.dev score-documentation-freshness · fiberplane.com drift-documentation-linter · arxiv.org 2511.12979 (RAGPulse) / 2502.15734 (Cache-Craft) / 2503.04800 (HoH) / 2509.19376 · sardine.ai + flagright.com false-positive control · docs.ragie.ai recency-bias
Full URLs with per-claim attribution are in the working file workspace/research/research-basis.md. Several sources were down-weighted or corrected during adversarial verification (e.g., a refuted "5%/60% hot-chunk" statistic, a dead LangChain v0.2 deep-link, Uber cadence re-sourced to primary) — those corrections are recorded.
results[0], the 8 subcommands with no update primitive, ingested_at as ingest-time, the preview embedding model) was re-checked against the actual source — all confirmed.Honesty notes: this is research/design only — nothing was changed. The deck bug is verified; the broader trash-can system is designed but recommended for phase 2. Where the original research over-reached (hyperscale machinery, correlated "independent" signals, an assumed metadata-update primitive that doesn't exist), the verification + critique caught and corrected it, and those corrections are reflected here rather than hidden.
workspace/research/ (final-design.json, research-basis.md, findings-digest.md).