Data Lifecycle & Stale-Content System

Research + design proposal  ·  cortextOS / SiteSmith  ·  2026-06-11  ·  research only — nothing implemented; every change below awaits your decision  ·  built from 11 research agents + 2 Codex cross-checks + live-code verification

The one-line answer

The "three sales decks, got the month-old one" problem is not a storage or "too much old data" problem — it's a retrieval bug: the system ranks by topic-similarity only, has no idea which version is current, and a hidden de-duplicator can silently drop the newest copy. I verified all of this in the live code. The "trash can for dead data" you described is a real and worthwhile second layer — fully designed below — but it should come after the ranking fix, because on an 8-agent system an auto-delete engine creates more review work than it saves.

In plain English
Contents

1. What you asked vs. what we found

You framed this as a data-hoarding / lifecycle problem: too much stale stuff, need a relevance-decay upgrade and a review-to-delete "trash can." We researched exactly that (how data centers, RAG systems, and records-management products do it — section 7). But when the research agents went into the actual cortextOS retrieval code to ground the design, they found the headline symptom (the deck miss) is caused upstream of any lifecycle system:

So the work splits cleanly into two layers: (A) fix retrieval so "latest/current X" is correct and the newest version isn't silently dropped — small, high-leverage, fixes your actual complaint; and (B) the broader stale-content lifecycle + reviewable trash can — valuable, fully designed, but phased after (A).

2. Root cause — three recency-blind failure modes verified in live code

The research agents read the live retrieval path (knowledge-base/scripts/mmrag.py + its three callers) and reproduced the failure. I independently re-verified every code claim below against the actual file — line numbers and behavior confirmed.

① Ranking is similarity-only — no recency, no "which is current"

cmd_query (mmrag.py:1153) orders results purely by cosine similarity — confirmed by reading the code: there is no recency or canonical term anywhere in the ranking (a live query came back ordered by descending similarity, best matches in the ~0.66–0.68 range). "Most recent" is never computed.

② The de-duplicator keeps the highest-similarity copy, not the newest — and can drop the others before you see them

deduplicate_results (mmrag.py:1128, >0.85 word-overlap on the first 500 chars) collapses near-duplicate results and keeps results[0] = the highest-semantic member, and it runs before the top-K trim. In a controlled test, three boilerplate-heavy "decks" (overlap 1.0) collapsed to a single survivor chosen by wording score — the newest was discarded inside Python before the agent ever saw it. (Note: it keeps the highest-semantic, which is recency-blind and can drop the newest — not literally "always the oldest." Also: 0.85 overlap on the first 500 chars can collapse monthly decks that differ only in later slides.)

③ A 0.5 similarity cutoff runs before any re-ranking

All three callers apply a 0.5 similarity threshold before any rerank could recover a result. The live best legitimate match scored only 0.66–0.68 — little headroom — so a correct-but-slightly-off-topic newer version can be filtered out entirely first.

Two more verified facts that shape the fix

3. The fix — phased

Codex split this into the minimal core that fixes your complaint vs. the operational layer that's nice-to-have. Reflected below.

V1 — core Make "latest/current X" a ranking outcome, not an accident

Step 0 before any of this: re-run the deck query against your real corpus at high top-K and threshold 0 to confirm which mode fires for your docs, and commit that as the regression baseline.

V1.5 — prerequisites & one-time migration

V2 — deferred The automated staleness detector + reviewable trash can (section 4)

Designed in full below, but do not pre-build it. On a low-query 8-agent brain where git already gives free, complete undo, an automated delete-flagging engine manufactures review burden and a delete-risk surface to solve what is, today, a ranking miss. Build it only when real staleness materializes beyond what V1 fixes.

4. The reviewable "trash can" — designed, recommended for phase 2

This is the system you described. The research is unanimous on the shape, and your substrate gives most of it for free:

detector flags a candidate (WITH evidence, multi-signal) | v SOFT-DELETE: set status=staged_for_review on every chunk (via set-meta) | -> excluded from active retrieval by a where-filter | -> the .md file stays on disk + in git, untouched v REVIEW MANIFEST (git-tracked): each item + its evidence + proposed replacement | v YOU review — one queue, one reviewer, MULTI-ACTION per item: Approve-purge / Keep / Archive / Supersede-link / Extend | v only an Approve-purge -> deliberate git rm in one reviewed commit (still recoverable from git history; no timer ever deletes)

5. Confidently identifying dead content — "by signals, not a guess"

This is your hard requirement, and the literature agrees on the method: fuse several weak, independent signals into one score; flag only when they agree; route everything ambiguous to human review; never auto-delete.

SignalStrength & honest caveat
Deterministic supersession (newer version / later effective-date / git timestamp / filename pattern like -v3, -2026-06)The anchor. Objective, primary-source. Strong enough to stage on its own. This alone solves the three-decks case.
Near-duplicate embedding cluster (cosine on the vectors already in Chroma)Confirms supersession; never used alone (two clients' decks cluster but are distinct). At 6,900 docs brute-force is trivial — skip MinHash/LSH.
Cross-reference orphan (nothing links to it)Conditional: only meaningful if brain files have machine-parseable links. If they're freeform prose, this is unreliable — must validate first.
Per-content-type recency decay (event docs expire; core facts never)On created_at/effective-date, never ingest-time. An old "this cycle's goals" is legitimately expired → flag-for-archive, not delete.
Retrieval hit-rate (never returned/used)Weak here. On a low-traffic 8-agent system, zero hits is mostly low query volume, not death. Long window, corroborator only, never alone.

The "not a guess" bar, corrected: require agreement of genuinely orthogonal signals. The tempting pair "superseded AND orphaned" is not orthogonal — superseding a doc causes its orphaning (references get re-pointed), so they're correlated by common cause. Anchor on deterministic supersession; fuzzy/embedding-only cases need a real confirmer or go to review.

6. Risks & the decisions you need to make

Top risk (gates everything)

Date provenance. If a doc's created_at is missing, copied, rebased, or actually edit/import time rather than the deck's real date, V1 returns the wrong "latest" with more confidence than today. Approve V1 only if date provenance is auditable and missing dates are visible in the explain output (degrade to neutral, never fall back to ingest-time).

Other risks
Decisions for you (full list in the appendix)
  1. Should doc metadata live authoritatively in .md frontmatter (git-diffable, survives re-ingest) and mirror into Chroma, or only in Chroma?
  2. Is mmrag.py owned locally (safe to extend) or pulled from upstream cortextOS (then changes go through the upstream-delivery path)?
  3. De-duplicator: minimal (rerank-before-dedup) or the more-correct source/family-aware version that preserves "show me all our decks"?
  4. What ChromaDB version is deployed? (Determines whether in-place metadata update and space-reclaiming delete work as assumed.)

7. How top orgs handle this — and what actually transfers

Four research angles, each web-sourced and adversarially verified. The honest throughline: most hyperscale machinery is a cost tradeoff that doesn't exist on one local box — but the lifecycle and review patterns transfer directly.

① Enterprise data lifecycle & tiered storage (AWS/Azure/GCP, Purview, records management)

Hot/warm/cold/archive tiering (S3 Glacier, Azure tiers) is fundamentally about saving money on rarely-touched bytes across storage with different latencies — none of that economics exists locally, where disk + Chroma are uniformly fast. What transfers, recast for relevance not cost: (1) the lifecycle state machine (active → stale-candidate → staged → deleted); (2) classification/tagging — the direct fix for "which deck is current"; (3) lifecycle rules with the expire action replaced by "stage for review" — exactly your review-before-delete constraint; (4) legal-hold/WORM inverted into a pin flag for canonical docs (git = the free compliance/recovery backstop); (5) Microsoft Purview disposition review as the near-exact blueprint for the reviewable trash can. Verified caveat: Purview offers an optional auto-approval timeout — we deliberately choose the stricter "human approves every purge" setting. Skip DoD 5015.02 / ISO 15489 / Glacier / SEC-WORM as overkill for 8 agents.

② RAG / vector-DB index hygiene & relevance decay

Re-centered on your real pain: this is supersession/canonical-version, not age. A month-old doc isn't "old." Critical finding: the off-the-shelf recency rerankers (LangChain TimeWeightedVectorStoreRetriever, LlamaIndex TimeWeightedPostprocessor) decay on last-accessed time and every retrieval refreshes it — so a stale-but-popular deck stays permanently "fresh" and the wrong answer gets entrenched. Decay must run on created_at/effective-date. Build the trash can from verified Chroma mechanics (in-place metadata update, where-filter exclusion). Add a staleness regression test — standard RAG eval (RAGAS) rates a stale-but-faithful answer 0.95 and won't catch this. Right-size: brute-force dedup beats LSH at 6,900 docs.

③ Soft-delete, archival & review-before-purge patterns

Separates cleanly into detection (what's dead — your "by signals") vs staging (reviewable trash + reversibility). Every enterprise system researched (Google Vault, M365 retention, Purview, S3 versioning, Notion/Confluence archive) is staging machinery and is time/age-triggered — a retention clock would expire all three decks by elapsed time and never notice deck #3 supersedes #1 and #2. So borrow the safe staging/review/audit lifecycle; build the supersession detector ourselves. Your substrate gives two of three staging tiers free: git (tombstone + audit + undo) and a Chroma "archived" flag (out of active retrieval, recoverable).

④ Automated staleness / obsolescence detection signals

Confidence comes from fusing multiple weak signals + a confidence "gray zone" routed to human review (the Dosu / temporal-RAG / fraud-screening pattern). Verified reframe of the three decks: embedding-cluster the decks as one family, pick newest by version/date as canonical, stage the older two. False-positive control is the deliverable (flagging something still-needed is the failure you fear most): per-content-type TTLs, a multi-signal threshold, a gray-zone review queue, and a per-doc "keep-forever" override. Verifier corrections folded in: retrieval-hit-rate downgraded to weak-corroborator at this scale; "N≥2 signals" must be genuinely orthogonal, not two correlated views of "oldness."

8. Sources

Lifecycle / tiering / records:
splunk.com/.../ilm-information-lifecycle-management · docs.aws.amazon.com S3 storage-class-intro / object-lifecycle-mgmt / object-lock-overview / tagging-best-practices · learn.microsoft.com Azure lifecycle-management + access-tiers · cloud.google.com storage-classes + lifecycle · learn.microsoft.com/purview/disposition · esd.whs.mil DoD 5015.02 · encompaas.cloud + zasio.com defensible-disposition

RAG hygiene / decay:
arxiv.org/abs/2510.08109 (Uber Eats semantic search) · milvus.io exponential-decay · python.langchain.com time_weighted_vectorstore + github issue 29306 · docs.llamaindex.ai TimeWeightedPostprocessor · docs.trychroma.com collections/update + filtering · pinecone.io update-delete-by-metadata · arxiv.org/pdf/2503.04800 (HoH benchmark) · medium.com embedding-model-upgrade-cost · atlan.com evaluate-rag · towardsdatascience.com rag-is-blind-to-time

Soft-delete / review:
jamestharpe.com/tombstone-pattern · streamkap.com cdc-soft-deletes · docs.aws.amazon.com DeleteMarker · support.google.com/vault · learn.microsoft.com/purview retention + disposition + preservation-lock · notion.com duplicate-delete-and-restore + custom-data-retention · github.com/chroma-core/chroma issue 3793 · git-tower.com restoring-deleted-files

Staleness detection:
github.com/Emmimal/temporal-rag · dosu.dev score-documentation-freshness · fiberplane.com drift-documentation-linter · arxiv.org 2511.12979 (RAGPulse) / 2502.15734 (Cache-Craft) / 2503.04800 (HoH) / 2509.19376 · sardine.ai + flagright.com false-positive control · docs.ragie.ai recency-bias

Full URLs with per-claim attribution are in the working file workspace/research/research-basis.md. Several sources were down-weighted or corrected during adversarial verification (e.g., a refuted "5%/60% hot-chunk" statistic, a dead LangChain v0.2 deep-link, Uber cadence re-sourced to primary) — those corrections are recorded.

9. Method & cross-check trail

Honesty notes: this is research/design only — nothing was changed. The deck bug is verified; the broader trash-can system is designed but recommended for phase 2. Where the original research over-reached (hyperscale machinery, correlated "independent" signals, an assumed metadata-update primitive that doesn't exist), the verification + critique caught and corrected it, and those corrections are reflected here rather than hidden.

Research + design proposal. No code, config, or data was modified. Built 2026-06-11 from an 11-agent research workflow, live-code verification, and 2 Codex cross-check passes. Working artifacts: workspace/research/ (final-design.json, research-basis.md, findings-digest.md).