Data Lifecycle & Stale-Content System

Research + design proposal · cortextOS / SiteSmith · 2026-06-11 · research only — nothing implemented; every change below awaits your decision · built from 11 research agents + 2 Codex cross-checks + live-code verification

The one-line answer

The "three sales decks, got the month-old one" problem is not a storage or "too much old data" problem — it's a retrieval bug: the system ranks by topic-similarity only, has no idea which version is current, and a hidden de-duplicator can silently drop the newest copy. I verified all of this in the live code. The "trash can for dead data" you described is a real and worthwhile second layer — fully designed below — but it should come after the ranking fix, because on an 8-agent system an auto-delete engine creates more review work than it saves.

In plain English

Why it grabbed the old deck: when you ask for "the most recent deck," the system finds the chunks that sound most like a deck and returns the top match. "Most recent" is never actually computed — there's no recency or "which one is current" logic anywhere in the ranking. Worse, a de-duplicator collapses near-identical decks and keeps the one that scored highest on wording, not the newest — so the current deck can be thrown away before you ever see it.
The real fix is small: teach retrieval to (a) group results by document, (b) know each doc's true date and which version is canonical, and (c) when you ask for "latest/current," rank by that — and show its work. ~Targeted change in one file plus its callers. Not a new storage system.
The "trash can" you want is the bigger system — confidently flag dead/duplicate content by multiple cross-referenced signals (never a guess), park it in a reviewable queue with the evidence, and delete nothing until you approve. It's designed in full below. Recommendation: build it after the ranking fix proves insufficient, so we don't manufacture review burden to solve a ranking miss.
The one thing that can make this confidently wrong: dates. If a doc's "created" date is missing or actually an edit/import date, the system will surface the wrong "latest" with more confidence than today. Date provenance is the gate on the whole thing.

Contents

1. What you asked vs. what we found
2. Root cause (verified in code)
3. The fix — phased
4. The reviewable "trash can" (designed, deferred)
5. Confident dead-content detection
6. Risks + decisions you need to make
7. How top orgs handle this
8. Sources
9. Method & cross-check trail

1. What you asked vs. what we found

You framed this as a data-hoarding / lifecycle problem: too much stale stuff, need a relevance-decay upgrade and a review-to-delete "trash can." We researched exactly that (how data centers, RAG systems, and records-management products do it — section 7). But when the research agents went into the actual cortextOS retrieval code to ground the design, they found the headline symptom (the deck miss) is caused upstream of any lifecycle system:

A month-old doc is not "old" under any sane recency decay (half-life 14–30 days barely moves it). So time-decay alone would not have fixed the deck miss.
The miss is a versioning / canonical-resolution failure plus a ranking that ignores recency entirely — and a de-duplicator that can drop the newest copy.

So the work splits cleanly into two layers: (A) fix retrieval so "latest/current X" is correct and the newest version isn't silently dropped — small, high-leverage, fixes your actual complaint; and (B) the broader stale-content lifecycle + reviewable trash can — valuable, fully designed, but phased after (A).

2. Root cause — three recency-blind failure modes verified in live code

The research agents read the live retrieval path (knowledge-base/scripts/mmrag.py + its three callers) and reproduced the failure. I independently re-verified every code claim below against the actual file — line numbers and behavior confirmed.

① Ranking is similarity-only — no recency, no "which is current"

cmd_query (mmrag.py:1153) orders results purely by cosine similarity — confirmed by reading the code: there is no recency or canonical term anywhere in the ranking (a live query came back ordered by descending similarity, best matches in the ~0.66–0.68 range). "Most recent" is never computed.

② The de-duplicator keeps the highest-similarity copy, not the newest — and can drop the others before you see them

deduplicate_results (mmrag.py:1128, >0.85 word-overlap on the first 500 chars) collapses near-duplicate results and keeps results[0] = the highest-semantic member, and it runs before the top-K trim. In a controlled test, three boilerplate-heavy "decks" (overlap 1.0) collapsed to a single survivor chosen by wording score — the newest was discarded inside Python before the agent ever saw it. (Note: it keeps the highest-semantic, which is recency-blind and can drop the newest — not literally "always the oldest." Also: 0.85 overlap on the first 500 chars can collapse monthly decks that differ only in later slides.)

③ A 0.5 similarity cutoff runs before any re-ranking

All three callers apply a 0.5 similarity threshold before any rerank could recover a result. The live best legitimate match scored only 0.66–0.68 — little headroom — so a correct-but-slightly-off-topic newer version can be filtered out entirely first.

Two more verified facts that shape the fix

The store is chunk-level, not document-level. Each doc explodes into N rows (file_id = md5(path)[:12] + _chunk{i}, mmrag.py:489). "Most recent deck" is a document-level question — so ranking must group chunks → source before applying recency, or one strong old chunk beats the right new document. (Codex flagged this as the actual crux.)
The only timestamp present is poisoned. ingested_at exists on every chunk but it's re-ingest time — re-ingesting a month-old deck yesterday stamps it "fresh." Decay must use created_at from frontmatter or git, never ingest-time (this is the same trap as LangChain/LlamaIndex's access-time decay rerankers).
There is no metadata-update primitive. mmrag.py subcommands are exactly ingest/query/status/list/collections/delete/reset/usage — no update / set-metadata command. Tagging docs as canonical, or soft-deleting them, both require building this first.

3. The fix — phased

Codex split this into the minimal core that fixes your complaint vs. the operational layer that's nice-to-have. Reflected below.

V1 — core Make "latest/current X" a ranking outcome, not an accident

Group chunks by source, then rank. Aggregate a source's score from its top chunks, apply recency/canonical at the document level, then return the best chunks from the winning source. (Crux per both Codex passes. Failure mode to guard: don't let long docs win via many mediocre chunks — aggregate from top-k chunks, not all.)
Insert a blended re-rank in cmd_query between the threshold filter and the de-duplicator: blended = α·similarity + β·recency_decay(created_at) + γ·canonical_boost → re-sort → then dedup (now keeps the freshest/canonical as results[0]) → then trim. This is the only location that works — the TypeScript layer only ever sees the already-trimmed five.
Gate recency on query intent. Only boost recency strongly for "latest / most recent / current / newest" queries; keep similarity dominant otherwise, so normal factual queries don't regress. (Codex: prevents "recency swamps relevance.")
Fix the threshold sequencing. Rerank-before-dedup does not fix mode ③ on its own — a sub-0.5 newer doc is dead before rerank. Use a low raw candidate floor + a final blended-score threshold (or exempt canonical/version docs), and over-fetch more candidates.
Ship "explain" output now: raw similarity, recency component, canonical flag, final score, and which duplicates were suppressed — so the next miss is diagnosable in seconds, not hours.
Mirror the re-sort in the two TS cross-collection mergers (they currently concat unsorted / sort by similarity only) so agent-facing and dashboard answers agree.
Pin evergreen docs (MEMORY.md, GOALS.md, GUARDRAILS.md, the one active deck) so recency/staleness logic can never down-rank or stage them.

Step 0 before any of this: re-run the deck query against your real corpus at high top-K and threshold 0 to confirm which mode fires for your docs, and commit that as the regression baseline.

V1.5 — prerequisites & one-time migration

Build a set-meta-by-source primitive (doesn't exist): update every chunk of a source's metadata in place, with a dry-run + exact chunk-count audit (partial metadata = worse ranking). Prerequisite for both canonical tagging and any future soft-delete.
Tiny metadata schema: doc_family, version, status, is_canonical, current, supersedes/superseded_by, created_at, effective_date, pinned. Note canonical ≠ current (an evergreen canonical doc is not the same as "newest deck"). Keep the vocabulary small — no enterprise PHI/PCI taxonomies.
Backfill + run a canonical resolver once over the ~6,900 docs: group by source → group by doc_family → order by version/date → mark the winner is_canonical, predecessors superseded. Route ambiguous families to review, never auto-supersede.
Generalize the existing where-filter (already implemented for --type) so callers can pass metadata filters — smaller build than "from scratch."

V2 — deferred The automated staleness detector + reviewable trash can (section 4)

Designed in full below, but do not pre-build it. On a low-query 8-agent brain where git already gives free, complete undo, an automated delete-flagging engine manufactures review burden and a delete-risk surface to solve what is, today, a ranking miss. Build it only when real staleness materializes beyond what V1 fixes.

4. The reviewable "trash can" — designed, recommended for phase 2

This is the system you described. The research is unanimous on the shape, and your substrate gives most of it for free:

detector flags a candidate (WITH evidence, multi-signal) | v SOFT-DELETE: set status=staged_for_review on every chunk (via set-meta) | -> excluded from active retrieval by a where-filter | -> the .md file stays on disk + in git, untouched v REVIEW MANIFEST (git-tracked): each item + its evidence + proposed replacement | v YOU review — one queue, one reviewer, MULTI-ACTION per item: Approve-purge / Keep / Archive / Supersede-link / Extend | v only an Approve-purge -> deliberate git rm in one reviewed commit (still recoverable from git history; no timer ever deletes)

git is your audit trail + undo. A purge is a recoverable commit; git rm'd content lives in history (restore via reflog). No new storage infra.
"Archived" = a Chroma metadata flag excluded from default queries — removed from active retrieval but fully recoverable. The file never leaves disk.
Multi-action review, not binary delete — copied from Microsoft Purview's disposition review (the closest enterprise analog), collapsed to one reviewer/one stage for your scale.
Pin / hold primitive so canonical/evergreen docs can never be staged; the gate can only be widened, never bypassed.
No auto-approval timer — explicitly the stricter setting (enterprise tools offer auto-disposal after N days; we don't enable it). Nothing is ever deleted without you.
Substrate gotchas to honor: Chroma and the .md/git store are two separate stores — re-sync after any archive/purge or they drift; and Chroma's own delete can strand data (needs an explicit rebuild to reclaim space) — re-confirm against the deployed Chroma version.

5. Confidently identifying dead content — "by signals, not a guess"

This is your hard requirement, and the literature agrees on the method: fuse several weak, independent signals into one score; flag only when they agree; route everything ambiguous to human review; never auto-delete.

Signal	Strength & honest caveat
Deterministic supersession (newer version / later effective-date / git timestamp / filename pattern like `-v3`, `-2026-06`)	The anchor. Objective, primary-source. Strong enough to stage on its own. This alone solves the three-decks case.
Near-duplicate embedding cluster (cosine on the vectors already in Chroma)	Confirms supersession; never used alone (two clients' decks cluster but are distinct). At 6,900 docs brute-force is trivial — skip MinHash/LSH.
Cross-reference orphan (nothing links to it)	Conditional: only meaningful if brain files have machine-parseable links. If they're freeform prose, this is unreliable — must validate first.
Per-content-type recency decay (event docs expire; core facts never)	On `created_at`/effective-date, never ingest-time. An old "this cycle's goals" is legitimately expired → flag-for-archive, not delete.
Retrieval hit-rate (never returned/used)	Weak here. On a low-traffic 8-agent system, zero hits is mostly low query volume, not death. Long window, corroborator only, never alone.

The "not a guess" bar, corrected: require agreement of genuinely orthogonal signals. The tempting pair "superseded AND orphaned" is not orthogonal — superseding a doc causes its orphaning (references get re-pointed), so they're correlated by common cause. Anchor on deterministic supersession; fuzzy/embedding-only cases need a real confirmer or go to review.

6. Risks & the decisions you need to make

Top risk (gates everything)

Date provenance. If a doc's created_at is missing, copied, rebased, or actually edit/import time rather than the deck's real date, V1 returns the wrong "latest" with more confidence than today. Approve V1 only if date provenance is auditable and missing dates are visible in the explain output (degrade to neutral, never fall back to ingest-time).

Other risks

Recency swamps relevance on normal queries → mitigated by query-intent gating + small β/γ.
Canonical pin becomes stale authority (an old "canonical" deck beats the newest) → canonical ≠ current; resolver must update it.
Dedup hides meaningful versions (0.85 overlap collapses decks differing only in later slides) → make dedup source/family-aware.
Two stores drift (Chroma vs .md/git) → re-sync after any change. The embedding model is a preview version (gemini-embedding-2-preview) that will be deprecated — plan a full one-shot re-embed as a "when," not "if."

Decisions for you (full list in the appendix)

Should doc metadata live authoritatively in .md frontmatter (git-diffable, survives re-ingest) and mirror into Chroma, or only in Chroma?
Is mmrag.py owned locally (safe to extend) or pulled from upstream cortextOS (then changes go through the upstream-delivery path)?
De-duplicator: minimal (rerank-before-dedup) or the more-correct source/family-aware version that preserves "show me all our decks"?
What ChromaDB version is deployed? (Determines whether in-place metadata update and space-reclaiming delete work as assumed.)

7. How top orgs handle this — and what actually transfers

Four research angles, each web-sourced and adversarially verified. The honest throughline: most hyperscale machinery is a cost tradeoff that doesn't exist on one local box — but the lifecycle and review patterns transfer directly.

① Enterprise data lifecycle & tiered storage (AWS/Azure/GCP, Purview, records management)

Hot/warm/cold/archive tiering (S3 Glacier, Azure tiers) is fundamentally about saving money on rarely-touched bytes across storage with different latencies — none of that economics exists locally, where disk + Chroma are uniformly fast. What transfers, recast for relevance not cost: (1) the lifecycle state machine (active → stale-candidate → staged → deleted); (2) classification/tagging — the direct fix for "which deck is current"; (3) lifecycle rules with the expire action replaced by "stage for review" — exactly your review-before-delete constraint; (4) legal-hold/WORM inverted into a pin flag for canonical docs (git = the free compliance/recovery backstop); (5) Microsoft Purview disposition review as the near-exact blueprint for the reviewable trash can. Verified caveat: Purview offers an optional auto-approval timeout — we deliberately choose the stricter "human approves every purge" setting. Skip DoD 5015.02 / ISO 15489 / Glacier / SEC-WORM as overkill for 8 agents.

② RAG / vector-DB index hygiene & relevance decay

Re-centered on your real pain: this is supersession/canonical-version, not age. A month-old doc isn't "old." Critical finding: the off-the-shelf recency rerankers (LangChain TimeWeightedVectorStoreRetriever, LlamaIndex TimeWeightedPostprocessor) decay on last-accessed time and every retrieval refreshes it — so a stale-but-popular deck stays permanently "fresh" and the wrong answer gets entrenched. Decay must run on created_at/effective-date. Build the trash can from verified Chroma mechanics (in-place metadata update, where-filter exclusion). Add a staleness regression test — standard RAG eval (RAGAS) rates a stale-but-faithful answer 0.95 and won't catch this. Right-size: brute-force dedup beats LSH at 6,900 docs.

③ Soft-delete, archival & review-before-purge patterns

Separates cleanly into detection (what's dead — your "by signals") vs staging (reviewable trash + reversibility). Every enterprise system researched (Google Vault, M365 retention, Purview, S3 versioning, Notion/Confluence archive) is staging machinery and is time/age-triggered — a retention clock would expire all three decks by elapsed time and never notice deck #3 supersedes #1 and #2. So borrow the safe staging/review/audit lifecycle; build the supersession detector ourselves. Your substrate gives two of three staging tiers free: git (tombstone + audit + undo) and a Chroma "archived" flag (out of active retrieval, recoverable).

④ Automated staleness / obsolescence detection signals

Confidence comes from fusing multiple weak signals + a confidence "gray zone" routed to human review (the Dosu / temporal-RAG / fraud-screening pattern). Verified reframe of the three decks: embedding-cluster the decks as one family, pick newest by version/date as canonical, stage the older two. False-positive control is the deliverable (flagging something still-needed is the failure you fear most): per-content-type TTLs, a multi-signal threshold, a gray-zone review queue, and a per-doc "keep-forever" override. Verifier corrections folded in: retrieval-hit-rate downgraded to weak-corroborator at this scale; "N≥2 signals" must be genuinely orthogonal, not two correlated views of "oldness."

8. Sources

Lifecycle / tiering / records:
splunk.com/.../ilm-information-lifecycle-management · docs.aws.amazon.com S3 storage-class-intro / object-lifecycle-mgmt / object-lock-overview / tagging-best-practices · learn.microsoft.com Azure lifecycle-management + access-tiers · cloud.google.com storage-classes + lifecycle · learn.microsoft.com/purview/disposition · esd.whs.mil DoD 5015.02 · encompaas.cloud + zasio.com defensible-disposition

RAG hygiene / decay:
arxiv.org/abs/2510.08109 (Uber Eats semantic search) · milvus.io exponential-decay · python.langchain.com time_weighted_vectorstore + github issue 29306 · docs.llamaindex.ai TimeWeightedPostprocessor · docs.trychroma.com collections/update + filtering · pinecone.io update-delete-by-metadata · arxiv.org/pdf/2503.04800 (HoH benchmark) · medium.com embedding-model-upgrade-cost · atlan.com evaluate-rag · towardsdatascience.com rag-is-blind-to-time

Soft-delete / review:
jamestharpe.com/tombstone-pattern · streamkap.com cdc-soft-deletes · docs.aws.amazon.com DeleteMarker · support.google.com/vault · learn.microsoft.com/purview retention + disposition + preservation-lock · notion.com duplicate-delete-and-restore + custom-data-retention · github.com/chroma-core/chroma issue 3793 · git-tower.com restoring-deleted-files

Staleness detection:
github.com/Emmimal/temporal-rag · dosu.dev score-documentation-freshness · fiberplane.com drift-documentation-linter · arxiv.org 2511.12979 (RAGPulse) / 2502.15734 (Cache-Craft) / 2503.04800 (HoH) / 2509.19376 · sardine.ai + flagright.com false-positive control · docs.ragie.ai recency-bias

Full URLs with per-claim attribution are in the working file workspace/research/research-basis.md. Several sources were down-weighted or corrected during adversarial verification (e.g., a refuted "5%/60% hot-chunk" statistic, a dead LangChain v0.2 deep-link, Uber cadence re-sourced to primary) — those corrections are recorded.

9. Method & cross-check trail

11 agents across a deterministic workflow: 4 parallel deep-research angles (web-sourced) → adversarial verification of each → synthesis → a completeness-critic gap pass → a refined design addressing every gap.
Live-code verification (me): every load-bearing code claim (the de-duplicator at mmrag.py:1128 keeping results[0], the 8 subcommands with no update primitive, ingested_at as ingest-time, the preview embedding model) was re-checked against the actual source — all confirmed.
2 Codex cross-checks (independent reasoning model): caught a residual hole (the 0.5 threshold isn't fixed by rerank-before-dedup alone), established source-level grouping as the real crux, added query-intent gating, the minimal-core-vs-gold-plating split, and named date-provenance as the #1 risk. Both folded in above.
Convergence: the second Codex pass surfaced no new critical gaps — it reinforced the first. That's the stopping condition.

Honesty notes: this is research/design only — nothing was changed. The deck bug is verified; the broader trash-can system is designed but recommended for phase 2. Where the original research over-reached (hyperscale machinery, correlated "independent" signals, an assumed metadata-update primitive that doesn't exist), the verification + critique caught and corrected it, and those corrections are reflected here rather than hidden.

Research + design proposal. No code, config, or data was modified. Built 2026-06-11 from an 11-agent research workflow, live-code verification, and 2 Codex cross-check passes. Working artifacts: workspace/research/ (final-design.json, research-basis.md, findings-digest.md).