goon/app
jtrzupek 7e46e5ac48 feat(scheduler): deep-crawl full tube catalogs (Phase 2a — ingest-all)
We ingested only ~3% of each browse tube's catalog (porndoe >62k scenes; we had 1959)
because tubes were hit only by performer-search + top-N browse. Pilot (porndoe pages
64-110): 1119 new scenes, 100% playable + 100% tagged, 0% canonical overlap (purely
additive — content not in TPDB/StashDB).

- app/scheduler/deep_crawl.py: round-robin over ALL_BROWSE_SCRAPERS, per-tube page cursor
  in app/_state/deepcrawl_state.json (no DB migration), deep-paginate from the cursor,
  idempotent (resolver skips known by raw_hash), mark 'exhausted' at catalog end then
  reset cursors for an incremental re-sweep.
- _job_deep_crawl: hourly, 60 pages/run (~1860 scenes, ~22 min), wrapped in the 1h
  hard-timeout; registered in build_scheduler (jobs=10).
- config: sched_deep_crawl_hours=1, deep_crawl_pages_per_run=60, deepcrawl_state_path.
- scripts/pilot_porndoe_deepcrawl.py: one-off pilot used to validate the approach.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-03 09:26:44 +02:00
..
api perf(scenes): drop exact count on filtered lists; index scene_tags(tag_id) 2026-06-02 12:00:36 +02:00
connectors fix(movies): paradisehill delta date-granularity + browse cadence docs 2026-06-01 17:00:10 +02:00
extractors fix(pornhub): WebView fallback — yt-dlp gets 403 from VPS 2026-06-02 21:41:38 +02:00
models perf(taxonomy): denormalize scene_count for tags/performers/studios 2026-05-31 17:53:48 +02:00
normalize feat(ingest): SQL phash match, tag inference + backfill, clip-store skip, browse tubes, watchdog 2026-06-01 15:07:35 +02:00
resolve fix(scenes): propagate playback duration to Scene + duration-consistent counts 2026-06-01 21:31:01 +02:00
scheduler feat(scheduler): deep-crawl full tube catalogs (Phase 2a — ingest-all) 2026-06-03 09:26:44 +02:00
templates feat(seo): public HTML SEO router + templates; add CLAUDE.md; ignore .nimbalyst 2026-05-31 16:29:59 +02:00
__init__.py Initial commit 2026-05-20 10:10:22 +02:00
auth.py Initial commit 2026-05-20 10:10:22 +02:00
config.py feat(scheduler): deep-crawl full tube catalogs (Phase 2a — ingest-all) 2026-06-03 09:26:44 +02:00
db.py Initial commit 2026-05-20 10:10:22 +02:00
ingest.py feat(ingest): SQL phash match, tag inference + backfill, clip-store skip, browse tubes, watchdog 2026-06-01 15:07:35 +02:00
main.py fix(apk 0.2.1): in-app installer "nic się nie dzieje" + oo launcher icon 2026-05-31 13:15:37 +02:00