We ingested only ~3% of each browse tube's catalog (porndoe >62k scenes; we had 1959)
because tubes were hit only by performer-search + top-N browse. Pilot (porndoe pages
64-110): 1119 new scenes, 100% playable + 100% tagged, 0% canonical overlap (purely
additive — content not in TPDB/StashDB).
- app/scheduler/deep_crawl.py: round-robin over ALL_BROWSE_SCRAPERS, per-tube page cursor
in app/_state/deepcrawl_state.json (no DB migration), deep-paginate from the cursor,
idempotent (resolver skips known by raw_hash), mark 'exhausted' at catalog end then
reset cursors for an incremental re-sweep.
- _job_deep_crawl: hourly, 60 pages/run (~1860 scenes, ~22 min), wrapped in the 1h
hard-timeout; registered in build_scheduler (jobs=10).
- config: sched_deep_crawl_hours=1, deep_crawl_pages_per_run=60, deepcrawl_state_path.
- scripts/pilot_porndoe_deepcrawl.py: one-off pilot used to validate the approach.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Resolver/perf:
- find_by_phash_within: nearest match via Postgres bit_count over bit(64) XOR
instead of Python scan of all phash fingerprints (~20x faster per scene;
unblocks long delta runs that were killed mid-run before since advanced).
Scheduler/reliability:
- reap ingest_runs stuck in 'running' on worker startup (killed_by_restart).
- smoke_test: per-source ingest health, stuck-run and browse-freshness checks
-> Sentry; exclude killed_by_restart from the failed-run alarm.
Tags (ingest with tags + fill blanks):
- wire infer_tag_slugs into normalize_scene so tube scenes get title-inferred
tags (was dead code); union with connector tags.
- scripts/backfill_inferred_tags.py: keyset/batched/idempotent backfill for
existing tagless scenes (playable tag coverage 16% -> ~52%).
Clip-store:
- skip ManyVids/IWantClips/Clips4Sale/... from canonical sources at ingest
(GOON_SKIP_CLIP_STORE, default on) — permanent orphans, ~56% of canonical
ingest, never have a free-tube playback source.
Browse tubes:
- enable fullmovies + hdporn.gg: studio parsed from title prefix instead of
the /networks/ sidebar (which always yielded the first listed network);
drop phash compute (pilot: 0% canonical hit within Hamming 5 — auto-screenshots),
matching relies on title/performer/duration.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Counts for /tags, /performers, /studios and /favorites were computed live
per-request by aggregating scene_tags / scene_performers with an EXISTS to
playback_sources. As the catalog grew to ~1.7M scenes (6.3M scene_tags) this
ran ~4.3s for /tags?order=popular (x2 incl. the total count) and ~950ms for
the default /scenes count, making those screens load in several seconds.
- migration 0019: add scene_count (+ DESC index) to tags/performers/studios
- background job _job_refresh_taxonomy_counts (every 3h) recomputes the counts
in one UPDATE..FROM each (IS DISTINCT FROM to skip unchanged rows)
- /tags, /performers, /studios scenes path now read the column + ORDER BY the
indexed scene_count; for_movies paths keep live aggregation (small tables)
- favorites read denormalized scene_count instead of a grouped EXISTS aggregate
- /scenes default count: 10-min in-process TTL cache (header is approximate)
Measured: /tags?order=popular&per_page=500 ~8s -> 66ms incl. serialization.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Goon — self-hosted aggregator for adult-content scene metadata.
Indexes scenes from TPDB, StashDB, and 30+ public adult tube sites.
Cross-source deduplication via perceptual hash + Levenshtein distance.
FastAPI backend + APScheduler worker + React Native (Expo) mobile client.
FOSS, ad-free, donation-funded. See README for details.