PlayTube CMS. Sitemap-based pagination (listing has no GET paging),
JSON-LD VideoObject metadata, pornstar/category pills, " Clips"
categories mapped to studio. Direct mp4 (cdnde.com/okcdn.ru), tokens
time-bound and portable cross-IP, so mobile plays direct.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
xvideos renders the scene's models as `<a href="/models/slug">...<span class="name">
Display Name</span>...`. The old _MODEL_RE wanted text immediately after the anchor
`>` and never matched current markup → browse-scraped scenes landed with 0 performers
(bug-report 2026-06-07: "no actors, but Rebecca Johnson is on the page"). New regex
captures slug + nested span.name, bounded within the anchor. + backfill script for the
~11.9k existing zero-performer xvideos scenes (54% have a real /models/ link; resolver
merges names to canonical by name_normalized).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
hqporner search post-filter kept a scene if its slug contained ANY query token
(>=3 chars). For multi-word performer names this matched on a single common token
(e.g. "anna","mia"), so the performer-driven ingest attributed the scene to EVERY
performer sharing that token — scenes accumulated up to 503 wrong performers
(hqporner = 5659 of 5897 scenes with >30 performers; bug-reports 2026-06-07).
Switch ANY->ALL: the slug must contain every query token, requiring a full name
match before attribution. Single-word names still work. Precision over recall —
144 wrong performers is far worse than missing a few loose matches.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The umbrella Source.name for all direct tube scrapers (deep-crawl, browse-latest,
performer-driven) was "pornapp" — a misleading leftover from the removed external
porn-app API. It read like a dependency on a third-party "pornapp" service; it is
not — these are our own scrapers hitting 25+ tubes directly (kind=scraper,
origin tube:<sitetag>). Renamed to "tube-scraper" via a single SCRAPER_SOURCE_NAME
constant; DB row renamed in place (UPDATE name, same id) so all ingest_runs +
external_records history stays linked. No behavior change — external_id keying
(sitetag:url) and dedup are unaffected.
NOTE: playback_sources.origin "pornapp:<sitetag>" prefix is a separate legacy
format (resolve_playback parses it) and is intentionally left untouched.
Verified on prod: row renamed (0 stray "pornapp"), new runs land on "tube-scraper".
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
porndish-only scenes had no tags and no description — the scraper only derived a
title from the URL slug. The scene page (g1/bimber WP theme) carries both: a
<p class="entry-tags"> list of /video2/<slug>/ links (the "#" tags the user sees,
categories + co-performers) and a prose description <p> in .entry-content.
Override _fetch_scene_metadata in PornDishScraper to pull both from one page
fetch. Extend the base hook to accept an optional 4th return element
(description) and thread it into RawScene.description — backward compatible with
the existing 3-tuple (pornhat). Strips leading embed-button labels
("Video Player N", "Server N") from the prose. Verified on live scenes: clean
tag lists + real descriptions.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
xvideos SSR's JSON-LD VideoObject (duration/title/uploadDate) + on-page /models/ (perf)
+ /tags/. Sample: median ~10.5min, 93% >=3min. Pilot (2 pages): 29 new, 100% playable +
visible + tagged (performers sparse — xvideos 'new' is amateur-heavy; /models/ tagged
mostly on studio rips).
- XVideosBrowseScraper (JSON-LD + page-parse models/tags), in ALL_BROWSE_SCRAPERS.
- deep_crawl._PAGE_CAP: per-sitetag depth cap; xvideoscom=1800 (~newest 50k). At the cap
the tube is marked exhausted (reset -> incremental re-sweep) so a mega-tube cannot
monopolize the round-robin or balloon the DB.
- ported yesporn.py into the public repo (was prod-only, like hdporngg) ending the
__init__ public/prod divergence.
youporn rejected: JSON-LD lacks actor/keywords, its /pornstar//category/ links are A-Z
nav not scene-specific. xhamster: 429/Cloudflare from the VPS IP.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
porntrex/hqporner rejected for deep-crawl: KVS sites with no SSR metadata (77% of
existing porntrex has no duration -> invisible under the app's >=60 filter). eporner
instead exposes a public JSON API (api/v2/video/search) returning title + length_sec
+ keywords + added per video; ~100k videos, ~100/page, no per-scene detail fetch.
- BaseBrowseScraper.crawl_page(page): factored out of latest_scenes; returns None
(transient fail) / [] (catalog end) / [scenes]. API subclasses override it.
- deep_crawl drives via crawl_page (supports HTML-listing AND API sources).
- EpornerApiScraper: crawl_page hits the eporner API -> RawScene with duration+tags+
date+thumb+playback; registered in ALL_BROWSE_SCRAPERS.
- Pilot (2 API pages): 192 new, 100% playable + tagged + visible (>=60); the <180s
trailer filter dropped 6 short clips.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- paradisehill.fetch_movies compared release_date coerced to midnight against the
`since` timestamp, so the chronological crawl stopped at the first upload dated
the same calendar day as `since` and silently dropped most new movies (0-2 seen
per run; Movies tab stalled). Compare by DATE with a 1-day grace instead; idempotent
external_records upsert dedups the re-fetched recent window.
- scripts/backfill_paradisehill_movies.py: one-off no-delta deep crawl to recover the
backlog missed during the bug (idempotent, resumable).
- docs: correct stale 'raz dziennie/24h' browse-latest comments to 6h (4x/day), the
actual configured cadence (config.py sched_browse_latest_hours=6).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Resolver/perf:
- find_by_phash_within: nearest match via Postgres bit_count over bit(64) XOR
instead of Python scan of all phash fingerprints (~20x faster per scene;
unblocks long delta runs that were killed mid-run before since advanced).
Scheduler/reliability:
- reap ingest_runs stuck in 'running' on worker startup (killed_by_restart).
- smoke_test: per-source ingest health, stuck-run and browse-freshness checks
-> Sentry; exclude killed_by_restart from the failed-run alarm.
Tags (ingest with tags + fill blanks):
- wire infer_tag_slugs into normalize_scene so tube scenes get title-inferred
tags (was dead code); union with connector tags.
- scripts/backfill_inferred_tags.py: keyset/batched/idempotent backfill for
existing tagless scenes (playable tag coverage 16% -> ~52%).
Clip-store:
- skip ManyVids/IWantClips/Clips4Sale/... from canonical sources at ingest
(GOON_SKIP_CLIP_STORE, default on) — permanent orphans, ~56% of canonical
ingest, never have a free-tube playback source.
Browse tubes:
- enable fullmovies + hdporn.gg: studio parsed from title prefix instead of
the /networks/ sidebar (which always yielded the first listed network);
drop phash compute (pilot: 0% canonical hit within Hamming 5 — auto-screenshots),
matching relies on title/performer/duration.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Bug-report 2026-05-30 "ingest znów się zawiesił". streamporn/pandamovies
wieszały się intermittentnie mid-run (zależnie od live-contentu danego dnia),
blokując sekwencyjny _job_movie_ingest → mangoporn (jedyny mirror z realnym
new-content: 72 nowych 05-28) nigdy nie startował. try/except chronił przed
wyjątkiem, NIE przed hangiem.
Fix:
- _job_movie_ingest: każdy connector w ThreadPoolExecutor z future.result
(timeout=360s). Hang jednego źródła → log + shutdown(wait=False) + kolejka
leci dalej. Healthy run ~50s, cap 6min = zapas.
- get_movie_connectors: reorder paradisehill, MANGOPORN, streamporn, pandamovies
— mangoporn zaraz po canonical primary, przed wolniejszymi/wieszającymi się.
Zweryfikowane: pełny _job_movie_ingest przeszedł wszystkie 4 success w nowej
kolejności (mangoporn 2nd, 23s). 33 osierocone "running" rows (worker ubity
mid-run przy deployach) wyczyszczone osobno.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Bug-report 2026-05-28 ("od wczoraj nie ma nowych filmow"). DooplayConnector
.fetch_movies mial `while True` po stronach bez bound; streamporn (>2k filmow)
wisial godzinami az do dailowego killa schedulera, blokujac kolejke mangoporn
+ pandamovies. Watermark zamrozony, dziennie 0 nowych filmow.
Fix: cap _MAX_PAGES_DELTA=3 (since-driven runs, ~144 najnowszych pozycji)
i _MAX_PAGES_FULL=50 (full backfill gdy since=None). Wczesniejsza proba
filtrowania przez release_date odrzucona - release_date to data wydania filmu
(np. 2013), nie data uploadu na strone, wiec sortowanie listing nie matchuje.
Po deployu manualne re-run: streamporn 144/46s, pandamovies 120/47s,
mangoporn 108 z 72 NEW filmow w 58s. Scheduler queue unblocked.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
filemoon (+ mirrory kerapoxy/lvturbo/emturbovid/bysezoxexe/bysezejataos)
nie umarł — ~2026-05 zrobił rebrand na Vite SPA "Byse Frontend". Stary
P.A.C.K.E.R.-JWPlayer embed zniknął, więc backend uznał go za martwego i
wpisał na DEAD_HOSTER_RE. RE bundla index-ChwZgmXV.js (2026-05-22):
POST /api/videos/<code>/embed/playback body {"fingerprint":{}}
→ {"playback":{"key_parts":[..],"iv":..,"payload":..}}
→ key=concat(b64url(key_parts)); AES-256-GCM(key,iv,payload) → JSON
→ sources[*].url = HLS master.m3u8
Browser-attestation jest opcjonalny — pusty fingerprint wystarcza.
Stream URL jest IP-bound (token wiąże się z IP requestera), więc resolve
musi iść z urządzenia użytkownika (jak doodstream.ts / packerHoster.ts).
- mobile/src/lib/aesGcm.ts — pure-JS AES-256-GCM decrypt (RN/Hermes nie
ma Web Crypto); S-box liczony z GF(2^8), GHASH weryfikuje tag.
Zweryfikowane przeciw cryptography (Python) na 2 payloadach.
- mobile/src/lib/filemoonHoster.ts — resolver: POST playback → decrypt →
pick best source. E2E test: filemoon.to/e + /d + bysezoxexe.com mirror.
- PlayerScreen: filemoon w resolve useEffect obok doodstream/packer.
- backend: filemoon poza DEAD_HOSTER_RE; hoster.py early-return → przelot
jako type='hoster' do mobile resolvera (server-side resolve bezcelowy,
bo URL IP-bound do VPS).
- direct_scrapers: poprawiony błędny komentarz "filemoon shutdown".
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Goon — self-hosted aggregator for adult-content scene metadata.
Indexes scenes from TPDB, StashDB, and 30+ public adult tube sites.
Cross-source deduplication via perceptual hash + Levenshtein distance.
FastAPI backend + APScheduler worker + React Native (Expo) mobile client.
FOSS, ad-free, donation-funded. See README for details.