Content moved to the w8.mypornerleak.com (wN) load-balancer subdomain, so
the old bare-domain scene regex matched nothing (frozen since 05-07).
Rewrote search() to scrape the canonical /actor/<slug>/ listing: scene
URL (wN host normalized to canonical for stable dedup), title, duration,
performers and category-tags from the <article> class (actors-*/category-*),
thumbnail. No studio (OnlyFans/amateur leaks have none). Multi-performer
works; playback unchanged (hoster, phone-side).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
sxyland dropped the /<numeric_id>/<slug>/ scene URL format for /<slug>/,
so the old regex matched nothing (frozen since 06-07). Rewrote search()
to use the performer page /actor/<slug>/ and fetch each scene for full
metadata: all performers (with co-stars, from /actor/ links), tags
(scoped to the scene's tags-list, not the sidebar), duration + upload
date (itemprop), studio from the title prefix (BraZZers/MilfCoach/... ,
guarded so a performer-name prefix isn't mistaken for a studio). Junk
nav pages (Terms of Use etc.) are dropped via a no-duration-and-no-tags
guard. Verified: clean studio/performers/tags in DB, 0 errors.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
sxyprn ingest was frozen since 05-07: the old ?type=videos&query= endpoint
returns trending (not performer-filtered), so the strict token filter
correctly dropped everything -> 0 ingest. Real "search" is the performer
page /<First-Last>.html. Rewrote search() to scrape those cards: clean
performer (the query, avoids sxyprn's Dallas/Rae name fragmentation),
studio (channel subcat), tags (#hashtags), duration, thumbnail. Token
filter now runs on the card title so only genuine matches attach the
performer. Verified: Lana Rhoades/Riley Reid/Angela White return results,
metadata persists in DB (studio e.g. Vixen, 10-31 tags/scene), playback
mp4 206.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
User-report (mobilism): scenes are often poorly titled, so saved keyword queries are a useful extra retrieval strategy. New saved_searches table (device-scoped via X-Device-Id, unique per device+query, 50/device cap) + GET/POST/DELETE /saved-searches. Migration 0024. Verified CRUD on prod: add trims+dedups idempotently, empty rejected 422, delete idempotent.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The global source monitor can't catch a single stalled tube because every tube scraper shares one Source row (tube-scraper), so an aggregate run still reports success while one origin freezes (freshporno browsing the rotating KVS homepage root, report 14f3a655). New watchdog checks max(created_at) per active browse-scraper origin (tube:<sitetag>); if a tube with history hasn't produced a new scene in > max_age_hours it fires a Sentry message with a stable per-origin fingerprint (age in extras, not the title, so it stays one grouped issue). Runs every 6h, 48h threshold, both env-tunable (GOON_SCHED_INGEST_WATCHDOG_HOURS / GOON_INGEST_WATCHDOG_MAX_AGE_HOURS). Verified: 0 stale at 48h post-fix, detects neporn at a strict 12h threshold.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The homepage root / is a KVS page with cache-control: no-store and a fresh PHPSESSID per request; the server rotates its featured block and on a cold session can serve an old set instead of the newest scenes. Result: browse-latest skipped everything for 3 days (root served 20 May content), no new freshporno scenes since 12 Jun (user report). Switch _listing_url to the explicit date-sorted /latest-updates/ feed (pagination /latest-updates/N/), which is not subject to that rotation.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The one-off cleanup merged ~13.5k same-video-different-title dupes, but they regrow as
these sibling tubes re-ingest under new titles. Wire the asset-id+duration merge into
the scheduler (every 12h, GOON_SCHED_THUMB_DEDUP_HOURS, 0=off) so it stays clean.
Shared logic lives in app/scheduler/thumb_dedup.py (run_thumb_asset_dedup); the one-shot
script now imports it. Same tight signature as the cleanup: family hosts only + identical
duration (the bare asset-id number is reused across unrelated CDNs, so cross-host/diff-
duration grouping is excluded). Reports 205b17d9 / 5a2944cb.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Reverse-engineered the migrated 4k69 player: jwplayer now serves OK.ru CDN (okcdn.ru)
mp4s. The static page (SSR behind Cloudflare, fetched via proxy) carries "file"+"label"
pairs for every quality. okcdn's srcIp param is NOT enforced (cross-IP test 2026-06-14:
206 video/mp4 from a residential IP != srcIp), so the URL plays from any IP. Parse the
okcdn sources server-side and return them mobile_direct_ok — the phone plays the direct
video, no WebView, no VAST preroll, no age-gate, zero VPS proxy. Skips 4K/2K. Reverts
the brief _vps_blocked_fallback routing (WebView grabbed the preroll ad, not content).
Verified on emulator: native player streams the actual scene (report 5de3fbc5).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
4k69 swapped its player from get_file (4kporno.xxx) to jwplayer + okcdn.ru, whose token
carries srcIp= (IP-bound); the site is also behind Cloudflare (VPS fetch only via proxy).
The native get_file extractor matched nothing and returned None, surfacing as a "host
problem" error even though the video plays fine (report 5de3fbc5). Switch 4k69com to
_vps_blocked_fallback: the on-device WebView (residential IP) clears Cloudflare, the
okcdn token binds to the phone IP, and INJECTED_JS hands the jwplayer source to ExoPlayer.
fourk69.extract stays in the module in case the site reverts to get_file.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
enrich-thumbnail was fill-only (skipped scenes that already had a thumbnail), so a
broken or stale preview (rotting sxyprn/trafficdeposit) could not be refreshed. Add a
force flag that re-fetches the source page and overwrites the existing thumbnail.
Backs the new "Refresh thumbnail" button (report d3376a71).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Porntrex soft-deletes: a removed video returns HTTP 200 with a "this video was deleted"
message instead of a player, so extract returned [] (transient) and the source was never
marked dead, leaving users on a permanently broken link (report 75dbf53e). Match the
deletion message and raise HosterDead so resolve marks the source dead.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Some sources (sexlikereal) build a giant `code`/`director` from a multi-performer
compilation title, overflowing scenes.code varchar(128) -> StringDataRightTruncation,
and the scene silently dropped from ingest. Cap both at the column limit in
_create_canonical and the fill path; code/director are stored metadata, not match keys,
so truncation is safe.
Fixes GOON-J
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Reports were anonymous and one-way. Tie each report to the submitting device
(X-Device-Id), add an admin response back-channel, and let the app fetch replies for
its own device:
- migration 0023: bug_reports gains device_id, response, responded_at, response_seen.
- create_bug_report captures device_id.
- GET /bug-reports/mine (device-scoped) returns this device's reports + unseen count.
- POST /bug-reports/mine/seen clears the unseen flag.
- POST /bug-reports/{id}/reply sets the admin response (authored during triage).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
A source (TPDB) returned a performer alias containing a literal U+0000 ("Ramon..").
Postgres cannot store in JSONB or text, so the external_records JSONB insert in
_upsert_external_record failed with UntranslatableCharacter and the scene never ingested
(GOON-Z). Recursively strip NUL from the raw payload (-> external_records.raw) and, when
present, also re-validate the RawScene/RawMovie so normalize -> typed text columns get
clean data too. Gated by a cheap _has_nul scan so clean records (the overwhelming
majority) pay no extra cost.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Time-bound HLS hosters whose manifest URL lacks a .m3u8 extension (e.g. pornhat's
"...mp4,?..." path) were mis-detected by ExoPlayer as progressive MP4 and failed,
forcing a full proxy fallback that streamed the whole video through the server. Serve
such manifests via /proxy/hls/<token>/play.m3u8 with child URLs left absolute on the
CDN, so the device fetches variant+segments directly and only the ~1KB manifest is
proxied. Routed only for mobile_direct_ok (time-bound) HLS without a .m3u8 path.
Also swallow httpx.TransportError in the stream proxy body generator: an upstream CDN
closing the connection mid-stream is benign (client just retries a range) and should
not surface as an unhandled error.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
It is an ISP proxy (static ISP IPs, flat billing), not residential —
so HTML-ingest bandwidth is free and the full deep-crawl is fine.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
superporn hard-blocks the VPS IP with Cloudflare 403 on every TLS
impersonation, so HTML ingest routes through Bright Data residential
(BRIGHTDATA_PROXY_URL, parsed in config). First scraper to use a proxy:
optional _proxy on the browse base, threaded into browser_get.
JSON-LD VideoObject (title/desc/uploadDate/thumb/duration) + pornstar
and category chips; superporn double-encodes HTML entities so titles
are unescaped twice. Thumbnails fetch fine from the VPS (no proxy).
Playback stays off-proxy: the <source> mp4 token is IP-bound to the
fetcher, so resolve is phone-side via WebView (extractor superporncom
-> _vps_blocked_fallback), same as porndoe.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
4k69.com (~65k scenes): same PlayTube CMS as hqfap - common logic moved
to _playtube.py (sitemap catalog, JSON-LD, pills). Studio classified by
matching category pills against the studios index page. Streams are
get_file (fullmovies family) returned unresolved with mobile_direct,
2160p skipped.
neporn.com: KVS engine, latest-updates listing, JSON-LD + video:duration
meta, performers from models links with flashvars video_tags fallback
for fresh uploads. Resolve via _kvs; final URL portable cross-IP.
superporn.com rejected: Cloudflare 403 from VPS on all TLS impersonations.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
PlayTube CMS. Sitemap-based pagination (listing has no GET paging),
JSON-LD VideoObject metadata, pornstar/category pills, " Clips"
categories mapped to studio. Direct mp4 (cdnde.com/okcdn.ru), tokens
time-bound and portable cross-IP, so mobile plays direct.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
trafficdeposit poster tokens live ~1h (hour-bucketed), so stored URLs can't persist.
New GET /proxy/sxyprn-thumb/{post_id}: resolves the current og:image from the live
/post/<id> page (cache resolved poster URL ~40min), streams bytes with Referer +
long client Cache-Control (URL is stable per post_id → client disk-caches the image,
backend fetches each post ~once). Deleted posts ("Post Not Found") → 404.
Scene grid now emits /proxy/sxyprn-thumb/<id> for sxyprn sources (derived from
page_url) instead of the dead stored trafficdeposit URL. Verified: live post → 200
image, deleted → 404, grid emits resolver URL. Backend-only, no OTA.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
CORRECTION: trafficdeposit thumbnail tokens are hour-bucketed and valid only ~1h
(verified 2026-06-10: stored ts=11:00 dead at 12:27, current ts=13:00 loads). Earlier
"~weekly rot" read was wrong. Storing/periodically-refreshing sxyprn thumbnail URLs
is futile — they expire within the hour. Default the refresh job OFF (kept in code).
The dead-marking sweep (Post Not Found → dead_at) it performed was still valid. Live
sxyprn thumbnails need on-demand resolution at serve time (future work).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
CORRECTION to earlier "unrecoverable" call: the /post/<id> page is alive (200) and
DOES expose the scene's own fresh-signed poster via og:image / <video poster>
(post-id embedded, current timestamp) — only the STORED thumbnail URL had rotted.
Search/listings don't re-surface old posts (0 overlap), but per-post fetch works.
scripts/refresh_sxyprn_thumbs.py: iterate live sxyprn sources, fetch post page,
extract fresh og:image, UPDATE thumbnail_url (verified: refreshed URLs return 200).
_job_refresh_sxyprn_thumbs: every 12h refresh the 1200 least-recently-updated sources
(cycles the ~19k catalog within the expiry window). Pairs with the scene_resolver
overwrite fix so refreshed thumbnails stick.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
_upsert_playback_sources only set thumbnail_url when the existing value was NULL,
so signed CDN thumbnails that ROT (sxyprn/trafficdeposit tokens expire ~weekly →
404) were never replaced even when a fresh re-scrape captured a valid URL — making
the rot permanent (bug 2026-06-10). Always overwrite thumbnail_url/animated_thumbnail_url
with the freshly-scraped value when present; other fields keep fill-if-null. Lets
the regular performer-driven ingest self-heal thumbnails for re-crawled scenes.
(Note: old sxyprn backlog can't be bulk-refreshed — search/listings don't re-surface
those posts, verified 0 overlap — so it's forward-looking; old sxyprn-only scenes
fall back to the clean placeholder.)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
sxyprn thumbnails are time-signed on trafficdeposit CDN and ROT — the signed asset
404s after ~weeks and can't be re-signed/refreshed server-side (bug 2026-06-10,
~15k sxyprn-only scenes showed broken thumbs). In the light-list slim-thumbnail pick,
prefer a thumbnail from any non-trafficdeposit source; fall back to sxyprn only when
it's the scene's sole thumbnail (recent ones still load; dead ones now render a clean
placeholder client-side instead of a broken image).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Public instance has no accounts, so all user state was GLOBAL in DB — new users
saw/overwrote each other's (and Jan's) favorites, watched badges and blacklists
(bug 2026-06-10). Add device_id (VARCHAR 64) to 9 state tables with composite PK
(device_id, entity_id); app sends X-Device-Id header (get_device_id dep). All
favorites/scene-favorites/blacklist/watch + scene&movie list/detail (is_favorite,
watched, blacklist-hide) now filter by device. Existing rows backfilled to
'legacy-shared'; POST /me/adopt-legacy reassigns them to the caller once. Old
clients (no header) map to legacy-shared so they keep working until OTA updates.
Migration 0022: add col, backfill, composite PK. Verified on prod: 967 progress
rows preserved, device isolation holds (new device sees none of legacy state).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The hold-to-preview gesture is being removed (did nothing useful), and no client
sends this filter. Remove the Query param, its EXISTS filter, and the pure-default
count guard reference.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
POST /scenes/{id}/hide — marks all playback_sources dead so the scene drops out
of has_playback lists (reversible via dead_at; row kept for dedup/refs).
POST /scenes/{keep_id}/merge/{drop_id} — merges drop into keep via scene_merge
(moves refs/performers/tags/fingerprints/playback). Backs the new tile long-press
menu (hide / mark-duplicate) replacing the dead animated-preview gesture.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
make_token embedded the current timestamp in the expiry, so every /scenes fetch
produced a DIFFERENT proxied URL for the same thumbnail → expo-image (keyed by URI)
cache-missed and re-downloaded every list load / app launch. Add stable_bucket_sec:
quantize the expiry base to a window so the URL is identical across requests.
_wrap_image_proxy uses a 7-day bucket → thumbnails disk-cache for a week instead of
re-fetching constantly. Answers "czy miniatury są cache'owane" — now yes.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
xhamster: move from WebView fallback to server-side native HLS. The scene page
is fetchable server-side and the xhcdn master m3u8 (variants + segments) is
time-bound, not IP-bound (verified cross-IP), so mobile plays the HLS direct
with zero proxy bandwidth. New tubes/xhamster.py pulls the master m3u8 from
SSR HTML and returns type='m3u8' mobile_direct; registry remaps xhamstercom
off _vps_blocked_fallback.
hqporner: add flyflv to the player-iframe host whitelist. hqporner rotated
some players to flyflv.com; the CDN host was already whitelisted but the iframe
host was not, so those scenes returned no stream.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
bug-report 1a4bf258: "Re-scrape mógłby zniknąć, za to tagi/kategorie by mogły".
Re-scrape was a dev-only bulk thumbnail/tag enrich — noise on the performer page
(per-scene enrich already happens on SceneDetail). Removed it; kept Search.
New GET /performers/{id}/tags aggregates scene_tags across the performer's
live-playback scenes (top N). PerformerScenes renders them as chips → tap navigates
to TagScenes. Search button widened to full row.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
A merged scene often aggregates several uploads from ONE tube (re-encodes / 4K
dups). bug-report aa79a995 "why 2 links, both porntrex?" = same scene std + 4K
(porntrex 2591377 + 2593449 "...in 4K"). In the UI these are indistinguishable
links to one hoster (same extractor). Keep one best per origin: prefer duration
matching the scene → any duration → first (origin-asc stable). Dead already filtered.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Chrome-DevTools investigation of bug-report 827a50a1 (sxyland "long loading,
then webview, no autoplay") showed sxyland embeds playmogo.com/e/<id> — a
DoodStream clone (doodcdn.io infra, pass_md5 protocol, get_slides) behind an
INVISIBLE Cloudflare Turnstile (not an interactive CAPTCHA; auto-passes in a
real browser/WebView from a residential IP). The sxyland page itself is NOT
Turnstile-gated — VPS curl pulls the playmogo iframe URL straight from the HTML.
sxylandcom was wired to _vps_blocked_fallback → phone loaded the entire sxyland
page in WebView (ads, click-to-play, no autoplay = the reported symptom), and the
playmogo embed never reached the phone's dood resolver. _embed_iframe (which
already lists sxyland in its docstring) extracts the playmogo embed and emits it
as type='hoster' → PlayerScreen routes playmogo URLs to doodstream.ts (resolveDoodStream),
which resolves phone-side (phone IP passes invisible Turnstile) → direct mp4 → autoplay.
Mobile unchanged (hoster→dood path already exists for xmoviesforyou/siska). Backend-only.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
merge_scenes never reassigned playback_sources → ON DELETE CASCADE dropped them
with the absorbed scene. Cross-source (canonical) merges rarely had tube playback
so it hid, but tube-dup merges silently LOST playback links. Add _move_playback_sources
(global unique (origin,page_url) guarantees no collision on reassign).
+ merge_exact_title_duration.py: catches missing-merge dupes bulk_dedup misses
(same performer + identical normalized title + identical duration_sec, no phash).
Bad Bella had 25 such pairs (bug-report ef92809d "duplikat, te same miniatury").
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
xvideos renders the scene's models as `<a href="/models/slug">...<span class="name">
Display Name</span>...`. The old _MODEL_RE wanted text immediately after the anchor
`>` and never matched current markup → browse-scraped scenes landed with 0 performers
(bug-report 2026-06-07: "no actors, but Rebecca Johnson is on the page"). New regex
captures slug + nested span.name, bounded within the anchor. + backfill script for the
~11.9k existing zero-performer xvideos scenes (54% have a real /models/ link; resolver
merges names to canonical by name_normalized).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
_candidate used OR logic (studio OR date±7d OR dur±30s) → 938,950 pairs;
Etap-2 scoring at ~110/s never finished in 1800s → bulk_dedup_performers HUNG
every run, orphan thread leaked until restart. Require AND: same studio plus
(date±2d OR dur±30s). 939k→16k pairs, full run 213s. Real cross-source dup of
one master shares studio + near date/duration; rare studio_id-mismatch pairs
skipped on purpose — a job that COMPLETES beats one that times out merging nothing.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
hqporner search post-filter kept a scene if its slug contained ANY query token
(>=3 chars). For multi-word performer names this matched on a single common token
(e.g. "anna","mia"), so the performer-driven ingest attributed the scene to EVERY
performer sharing that token — scenes accumulated up to 503 wrong performers
(hqporner = 5659 of 5897 scenes with >30 performers; bug-reports 2026-06-07).
Switch ANY->ALL: the slug must contain every query token, requiring a full name
match before attribution. Single-word names still work. Precision over recall —
144 wrong performers is far worse than missing a few loose matches.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Tag-filtered scene lists (e.g. blowjob + has_playback) took 4-12s. Root cause:
the filter joined scene_tags->tags on slug, so the actual tag_id was opaque to
the planner at plan time. It fell back to average per-tag cardinality
(8.4M/11541 ≈ 726) instead of the real 273k, chose to materialize ALL matching
scene_tags + check playback per row, then top-N sort.
Fix: resolve slug->tag_id in the app and filter on a LITERAL tag_id (no slug
join). With a constant, the planner uses MCV stats, knows the tag is huge, and
walks ix_scenes_created_at_desc probing scene_tags/playback per scene, stopping
at the page limit. Verified: blowjob list 3300ms -> 18ms (EXPLAIN), HTTP 4-12s ->
47ms. Unknown slug short-circuits to empty. (Pairs with the raised tag_id
statistics target so mid-tier tags also get correct estimates.)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Scene list returned the full SceneOut per item (nested tags/external_refs + all
playback_sources with page_url/embed/stream/quality) though SceneTile only reads
the thumbnail + title/duration/performer/studio, and SceneDetail re-fetches the
full scene via /scenes/{id}. Added light=True to _build_scenes_out_batch: skip the
tags + external_refs queries entirely and collapse playback_sources to one slim
entry (thumbnail_url + animated_thumbnail_url only).
Result: default list payload 78KB->48KB (-38%), ~28ms cached, less DB work per
list. Verified on emulator: grid thumbnails/durations/titles render unchanged.
No mobile change (tile reads the same fields); server-side, no OTA.
NOTE: the separate slow path — common-tag-filtered lists (4-12s; query expands all
matching scene_tags before sort/limit) — is structural (needs a denormalized
(tag_id, created_at) index) and deferred. VACUUM ANALYZE + raised tag_id stats
applied but the planner still can't avoid the materialization.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
bug-report 2026-06-03 ("ten sam czas, ta sama miniaturka, czemu się nie mergują"):
duplicate scenes not merged at ingest. Exact phash alone is noisy here (95% are
collisions on shared thumbnails/intro frames — different scenes; bulk_dedup scorer
correctly gives 0 auto-merge). The safe subset is exact-phash AND same duration
(±3s) AND shared performer/title — near-certain same scene. Same-duration is key:
it excludes the false-merge pattern (short-clip-vs-full has DIFFERING durations).
- scripts/merge_phash_exact_dupes.py: one-off, dry-run by default, per-pair re-fetch
(handles clusters). Applied: 30 merged.
- bulk_dedup: add `_pairs_exact_phash` (SQL O(N log N), not the O(N²) Hamming scan)
+ strategy "phash_exact" — gated by the normal scorer (surfaces review candidates,
no risky auto-merge), schedulable for ongoing exact-collision review.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The umbrella Source.name for all direct tube scrapers (deep-crawl, browse-latest,
performer-driven) was "pornapp" — a misleading leftover from the removed external
porn-app API. It read like a dependency on a third-party "pornapp" service; it is
not — these are our own scrapers hitting 25+ tubes directly (kind=scraper,
origin tube:<sitetag>). Renamed to "tube-scraper" via a single SCRAPER_SOURCE_NAME
constant; DB row renamed in place (UPDATE name, same id) so all ingest_runs +
external_records history stays linked. No behavior change — external_id keying
(sitetag:url) and dedup are unaffected.
NOTE: playback_sources.origin "pornapp:<sitetag>" prefix is a separate legacy
format (resolve_playback parses it) and is intentionally left untouched.
Verified on prod: row renamed (0 stray "pornapp"), new runs land on "tube-scraper".
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Two bug-report fixes (2026-06-07):
- sxyprn returns HTTP 200 "Post Not Found" for deleted posts (soft-404), so the
extractor returned None → resolve treated it as transient and never marked the
source dead, leaving a dead link offered forever. Now raise HosterDead on the
marker so resolve marks it dead.
- Scene playback sources were ordered alphabetically by origin, so a WebView-
fallback hoster (fpoxxx, IP-bound + ad-heavy) ranked above a working native
source (freshporno) on the same scene. Add is_vps_blocked_fallback() and sort
native-resolve origins ahead of WebView-fallback ones.
Verified on prod: sxyprn dead URL → HosterDead; scene sources reorder
freshpornoorg before fpoxxx.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
_job_bulk_dedup_performers called run_bulk_dedup(strategy="performers") without
the cross_source_only guard whose docstring exists precisely to prevent this OOM.
At current catalog scale the unguarded path materializes N²/2 pairs per prolific
performer into a list → worker hit 6GB RSS and was OOM-killed every 12h (05:00/
17:00), taking down concurrent tpdb/stashdb/movie ingests as killed_by_restart
(0 new movies). Verified in prod: 05:00 run now completes (885k pairs scored, no
OOM) and ingests succeed (stashdb +241, tpdb +175).
Also wrap in _run_with_timeout like tpdb/stashdb (job had no hard-timeout).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
TPDB taxonomy emits numbered-duplicate tags (name "Bubble Butt2"); slugify
yields "bubble-butt2" (no separator before digit), so resolve_tag created a
separate tag alongside "bubble-butt". Tube scenes inherited the dup via
scene-merge → 75 pairs, ~10k scene_tags on the wrong tag.
- resolve_tag: canonicalize "<base>2" -> "<base>" when base exists (handles
current + future; trailing-"2"+alpha guard leaves milf-30/teen18 intact)
- scripts/merge_dup2_tags.py: one-off bulk merge (scene_tags + movie_tags +
blacklist) and taxonomy-count refresh
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
porndish-only scenes had no tags and no description — the scraper only derived a
title from the URL slug. The scene page (g1/bimber WP theme) carries both: a
<p class="entry-tags"> list of /video2/<slug>/ links (the "#" tags the user sees,
categories + co-performers) and a prose description <p> in .entry-content.
Override _fetch_scene_metadata in PornDishScraper to pull both from one page
fetch. Extend the base hook to accept an optional 4th return element
(description) and thread it into RawScene.description — backward compatible with
the existing 3-tuple (pornhat). Strips leading embed-button labels
("Video Player N", "Server N") from the prose. Verified on live scenes: clean
tag lists + real descriptions.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Movies: the seekplayer-engine family (easyvidplayer/player4me/seekplayer/
embedseek/upns, ~322k sources) returns a time-bound master.m3u8 on a CDN with a
valid IP-SAN cert that plays cross-IP. Mark it mobile_direct in resolve, and make
MovieDetailScreen prefer direct_url with a proxy fallback (mirrors the scene
path) — previously every movie streamed through the VPS proxy. Paradisehill
multipart parts now go direct too. Device-verified: ExoPlayer plays the raw CDN
direct, zero proxy traffic, no flicker.
Scenes: the three blacklist NOT EXISTS clauses were appended to every filtered
list and evaluated per-row even when all blacklist tables are empty (~3.4s tax on
a deep mega-tag walk). Skip them when the tables are empty (cached check) —
mega-tag list 6.7s -> 3.3s, and every filtered list benefits.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
User: "hdporngg loading forever". DevTools + cross-IP investigation (not guessing):
- site is alive (sample scenes 200; the one earlier 404 was a single removed video,
not the site — my earlier "site dead" was a hasty generalization).
- both are the same platform (<source src=.../get_file/8512/...mp4>), no function/0.
- the get_file 302 is fast (~100ms) but the 2160p/4K source on fpvcdn.com TIMES OUT
(~30s); 720p/480p resolve in ~1s. The player loading 4K first = the "loading forever".
- the final fpvcdn URL embeds the requester IP (ip=<fetcher>) -> IP-bound to whoever
resolves it; BUT the get_file itself is stateless (fresh session works) and valid >=90s,
and binds fpvcdn to the fetcher. So a VPS resolve would bind to the VPS IP (mobile 403),
but returning the get_file URL UNRESOLVED lets the phone follow the 302 itself ->
fpvcdn binds to the phone IP -> plays.
Fix: new _source_getfile resolver returns get_file URLs as mobile_direct (skip 4K),
phone resolves the 302 in-session. Native, multi-quality, no WebView, no proxy.
Replaces fullmovies' old force_proxy+4K extractor and the WebView fallback for both.
Backend-verified: resolve -> 720/480 mobile_direct, get_file fresh fetch -> 206. Pending
on-device confirmation (emulator unstable; same mechanism as porn00/freshporno which work).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Same proper re-investigation as freshporno (DevTools + Bright Data residential
cross-IP + curl_cffi browser TLS). porn00's final CDN fe.porn00.org/...?token=&expires=
is PORTABLE cross-IP (token resolved from one residential IP replays 206 from a
different Bright Data residential IP) and only rejects non-browser TLS (plain curl
403, curl_cffi chrome 206). In #20 I tested the final URL with a standalone plain
curl, got 403, wrongly concluded "IP-bound" and left it on WebView (and before that
it used force_proxy, which violated the no-proxy stance).
porn00 flashvars are plain get_file (already decoded, no function/0 prefix), so
extend _kvs._URL_RE to match both forms — real_url passes plain URLs through
unchanged, _resolve_get_file follows the 302 in-session. porn00.py becomes a thin
_kvs wrapper. Verified no regression for the function/0 tubes (yespornvip/pornditt/
freshporno still resolve 3x mp4). Result: porn00 native multi-quality, mobile_direct,
zero proxy/WebView.
fpoxxx and pornxp were re-tested the same way and ARE genuinely IP-bound (403 from a
different residential IP — their token binds to the resolver IP), so they correctly
stay on the WebView fallback.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Re-investigated with the proper method (Chrome DevTools network capture + cross-IP
test via Bright Data residential proxy + curl_cffi browser-TLS) instead of guessing.
freshporno's real flow is get_file -> 302 -> cdn4.freshporno.org/remote_control.php
-> 206 video/mp4. The CDN URL is PORTABLE cross-IP (a token generated from one
residential IP replays fine from the VPS and from a different Bright Data residential
IP), it only rejects non-browser TLS fingerprints (plain curl -> 000, curl_cffi
chrome / ExoPlayer -> 206).
In #20 I tested the final URL with a standalone plain curl, got 000, and wrongly
concluded "unreachable from residential" -> kept it on the WebView fallback, which
barely worked (ad-heavy page, flaky). That false negative is the regression the user
reported. freshporno is function/0 KVS, so _kvs.resolve_kvs (which uses curl_cffi
chrome) already decodes + resolves it to a portable mp4 — switch to backend resolve
like yespornvip/pornditt: native, multi-quality, no proxy, no WebView.
Verified: backend resolve returns 3x mp4 (1080/720/480, mobile_direct) + cdn 206;
user confirmed native playback on device.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Bug 19866e9e ("problem z oboma hosterami"): a scene whose only two sources were
fullmovies.xxx and hdporn.gg wouldn't play at all — neither had an entry in the
extractor registry, so try_extract returned None ("no stream"). fullmovies.xxx
serves a <source ...get_file...mp4> but the get_file CDN times out from the VPS
(unreachable, like freshporno), so backend resolve isn't viable; hdporn.gg sample
pages 404. Route both through the WebView fallback so the phone (residential IP)
loads the page and plays / the injected-JS scrape can grab the URL — strictly
better than no playback path. Surfaced by the hoster sweep + this bug report.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
xvideos SSR's JSON-LD VideoObject (duration/title/uploadDate) + on-page /models/ (perf)
+ /tags/. Sample: median ~10.5min, 93% >=3min. Pilot (2 pages): 29 new, 100% playable +
visible + tagged (performers sparse — xvideos 'new' is amateur-heavy; /models/ tagged
mostly on studio rips).
- XVideosBrowseScraper (JSON-LD + page-parse models/tags), in ALL_BROWSE_SCRAPERS.
- deep_crawl._PAGE_CAP: per-sitetag depth cap; xvideoscom=1800 (~newest 50k). At the cap
the tube is marked exhausted (reset -> incremental re-sweep) so a mega-tube cannot
monopolize the round-robin or balloon the DB.
- ported yesporn.py into the public repo (was prod-only, like hdporngg) ending the
__init__ public/prod divergence.
youporn rejected: JSON-LD lacks actor/keywords, its /pornstar//category/ links are A-Z
nav not scene-specific. xhamster: 429/Cloudflare from the VPS IP.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>