superporn hard-blocks the VPS IP with Cloudflare 403 on every TLS
impersonation, so HTML ingest routes through Bright Data residential
(BRIGHTDATA_PROXY_URL, parsed in config). First scraper to use a proxy:
optional _proxy on the browse base, threaded into browser_get.
JSON-LD VideoObject (title/desc/uploadDate/thumb/duration) + pornstar
and category chips; superporn double-encodes HTML entities so titles
are unescaped twice. Thumbnails fetch fine from the VPS (no proxy).
Playback stays off-proxy: the <source> mp4 token is IP-bound to the
fetcher, so resolve is phone-side via WebView (extractor superporncom
-> _vps_blocked_fallback), same as porndoe.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
porntrex/hqporner rejected for deep-crawl: KVS sites with no SSR metadata (77% of
existing porntrex has no duration -> invisible under the app's >=60 filter). eporner
instead exposes a public JSON API (api/v2/video/search) returning title + length_sec
+ keywords + added per video; ~100k videos, ~100/page, no per-scene detail fetch.
- BaseBrowseScraper.crawl_page(page): factored out of latest_scenes; returns None
(transient fail) / [] (catalog end) / [scenes]. API subclasses override it.
- deep_crawl drives via crawl_page (supports HTML-listing AND API sources).
- EpornerApiScraper: crawl_page hits the eporner API -> RawScene with duration+tags+
date+thumb+playback; registered in ALL_BROWSE_SCRAPERS.
- Pilot (2 API pages): 192 new, 100% playable + tagged + visible (>=60); the <180s
trailer filter dropped 6 short clips.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- paradisehill.fetch_movies compared release_date coerced to midnight against the
`since` timestamp, so the chronological crawl stopped at the first upload dated
the same calendar day as `since` and silently dropped most new movies (0-2 seen
per run; Movies tab stalled). Compare by DATE with a 1-day grace instead; idempotent
external_records upsert dedups the re-fetched recent window.
- scripts/backfill_paradisehill_movies.py: one-off no-delta deep crawl to recover the
backlog missed during the bug (idempotent, resumable).
- docs: correct stale 'raz dziennie/24h' browse-latest comments to 6h (4x/day), the
actual configured cadence (config.py sched_browse_latest_hours=6).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Goon — self-hosted aggregator for adult-content scene metadata.
Indexes scenes from TPDB, StashDB, and 30+ public adult tube sites.
Cross-source deduplication via perceptual hash + Levenshtein distance.
FastAPI backend + APScheduler worker + React Native (Expo) mobile client.
FOSS, ad-free, donation-funded. See README for details.