The umbrella Source.name for all direct tube scrapers (deep-crawl, browse-latest,
performer-driven) was "pornapp" — a misleading leftover from the removed external
porn-app API. It read like a dependency on a third-party "pornapp" service; it is
not — these are our own scrapers hitting 25+ tubes directly (kind=scraper,
origin tube:<sitetag>). Renamed to "tube-scraper" via a single SCRAPER_SOURCE_NAME
constant; DB row renamed in place (UPDATE name, same id) so all ingest_runs +
external_records history stays linked. No behavior change — external_id keying
(sitetag:url) and dedup are unaffected.
NOTE: playback_sources.origin "pornapp:<sitetag>" prefix is a separate legacy
format (resolve_playback parses it) and is intentionally left untouched.
Verified on prod: row renamed (0 stray "pornapp"), new runs land on "tube-scraper".
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
We ingested only ~3% of each browse tube's catalog (porndoe >62k scenes; we had 1959)
because tubes were hit only by performer-search + top-N browse. Pilot (porndoe pages
64-110): 1119 new scenes, 100% playable + 100% tagged, 0% canonical overlap (purely
additive — content not in TPDB/StashDB).
- app/scheduler/deep_crawl.py: round-robin over ALL_BROWSE_SCRAPERS, per-tube page cursor
in app/_state/deepcrawl_state.json (no DB migration), deep-paginate from the cursor,
idempotent (resolver skips known by raw_hash), mark 'exhausted' at catalog end then
reset cursors for an incremental re-sweep.
- _job_deep_crawl: hourly, 60 pages/run (~1860 scenes, ~22 min), wrapped in the 1h
hard-timeout; registered in build_scheduler (jobs=10).
- config: sched_deep_crawl_hours=1, deep_crawl_pages_per_run=60, deepcrawl_state_path.
- scripts/pilot_porndoe_deepcrawl.py: one-off pilot used to validate the approach.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>