Movie detail showed ~100 playback links (report 41ca1fa4) because the 3 dooplay mirrors (mangoporn/pandamovies/streamporn) each record the SAME hoster embed as a separate row (e.g. luluvid/e/X from all three). Dedup by real target (embed_url/stream_url/page_url) after the priority sort, keeping the highest-priority copy — one verified movie drops 101 -> 58 unique.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
siska's ?s= search ignores the query (returns latest regardless), so the performer-driven search scraper always yielded 0 and was disabled. Rewrote SiskaScraper as a latest-browse scraper (BaseBrowseScraper, /page/<n>/) and moved it to ALL_BROWSE_SCRAPERS. The listing tile carries everything (no detail fetch): title, duration (MM:SS span), thumbnail (img data-src), performer + studio (img alt 'Performer - Title - Studio'), category (thumbnail path). Playback unchanged: fresh videos embed playmogo + luluvid, resolved phone-side via _embed_iframe. Verified ingest: 26 seen / 11 new / 15 updated / 0 errors — and 15 updated means siska scenes match existing canonical scenes, adding playback coverage rather than orphans. Now covered by the browse ingest-watchdog (48h) and the 6h browse-latest + deep-crawl jobs. Old self-player videos (player.siska.video -> cfglobalcdn, ~2018) are dead and age out.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Revisited siska re-enable (user fa4083a2). Findings: (1) fresh siska videos (videoID 227xxx) embed playmogo + luluvid and ARE phone-resolvable; updated siska.py scene regex + extractor path to the current video.php?videoID= format (old /<slug>/ format is gone). (2) BUT siska's ?s=<query> search is broken site-side — it returns the latest videos regardless of query (angela white == riley reid == homepage), so as a performer-driven BaseSearchScraper it always yields 0 (title token filter rejects everything). Reviving siska would require converting it to a browse/latest scraper (changes ingest character) — left as a decision. Old self-player videos (player.siska.video -> cfglobalcdn) are dead. Scraper stays disabled.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Missing-merge duplicates (same performer + identical normalized title + identical duration-to-the-second) that bulk_dedup misses — tube re-scrapes and cross-tube re-ingests like porn00 pulling a video already present from xnxx (reports 28fe8181/32df33b1). Extracted the proven merge_exact_title_duration logic into app/scheduler/title_duration_dedup.py (script now a thin wrapper), wired a 12h scheduler job (playback-only = what users actually see, GOON_SCHED_TITLE_DEDUP_HOURS). Signal is near-certain (two different videos don't share byte-identical title AND exact duration); no shared performer = not merged (over-match guard). Verified: job registers (jobs=14), backlog currently 0 after the one-shot global merge.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
scene_resolver._sync_tags used check-then-insert (select existing -> add if None), which races under concurrent ingest of the same scene: two runs both see existing=None, both add, flush -> IntegrityError pk_scene_tags (Sentry GOON-M, 4 events). Switched to pg_insert(...).on_conflict_do_nothing(index_elements=[scene_id, tag_id]) + in-batch dedup, identical to movie_resolver._sync_tags.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Two observability additions to the worker scheduler (intertwined in the same files): (1) ingest-watchdog now also covers performer-driven search scrapers (ALL_DIRECT_SCRAPERS) with a separate 7d threshold, not just browse tubes at 48h — several search tubes (perverzija, fpoxxx, porndish, ...) had frozen silently for weeks. (2) New Hetzner Cloud bandwidth monitor (app/scheduler/hetzner_monitor.py): polls outgoing_traffic vs included_traffic and fires a Sentry message at info/warning/error % thresholds with a per-level fingerprint. The config fields existed for ages but the monitor was never implemented. No-op until HETZNER_API_TOKEN + HETZNER_SERVER_ID are set in .env (verified: returns {enabled: False}, job registers as 'hetzner-monitor every 6h', jobs=13).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Old regex matched junk (/wp-json etc.), not scenes (scenes are /<post_id>/).
Frozen since 06-13. Rewrote search() to scrape the /actor/<slug>/ listing
and parse <article> cards: scene URL, title, performers + tags from the
class (actors-*/tag-*/category-*, dropping performer-name fragment tags),
thumbnail. Studio + release date parsed from the "<Studio>-YYYY-MM-DD"
thumbnail filename, with a title-prefix "<Studio> YY MM DD" fallback.
Multi-performer works; no duration in listing; playback unchanged (hoster).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Content moved to the w8.mypornerleak.com (wN) load-balancer subdomain, so
the old bare-domain scene regex matched nothing (frozen since 05-07).
Rewrote search() to scrape the canonical /actor/<slug>/ listing: scene
URL (wN host normalized to canonical for stable dedup), title, duration,
performers and category-tags from the <article> class (actors-*/category-*),
thumbnail. No studio (OnlyFans/amateur leaks have none). Multi-performer
works; playback unchanged (hoster, phone-side).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
sxyland dropped the /<numeric_id>/<slug>/ scene URL format for /<slug>/,
so the old regex matched nothing (frozen since 06-07). Rewrote search()
to use the performer page /actor/<slug>/ and fetch each scene for full
metadata: all performers (with co-stars, from /actor/ links), tags
(scoped to the scene's tags-list, not the sidebar), duration + upload
date (itemprop), studio from the title prefix (BraZZers/MilfCoach/... ,
guarded so a performer-name prefix isn't mistaken for a studio). Junk
nav pages (Terms of Use etc.) are dropped via a no-duration-and-no-tags
guard. Verified: clean studio/performers/tags in DB, 0 errors.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
sxyprn ingest was frozen since 05-07: the old ?type=videos&query= endpoint
returns trending (not performer-filtered), so the strict token filter
correctly dropped everything -> 0 ingest. Real "search" is the performer
page /<First-Last>.html. Rewrote search() to scrape those cards: clean
performer (the query, avoids sxyprn's Dallas/Rae name fragmentation),
studio (channel subcat), tags (#hashtags), duration, thumbnail. Token
filter now runs on the card title so only genuine matches attach the
performer. Verified: Lana Rhoades/Riley Reid/Angela White return results,
metadata persists in DB (studio e.g. Vixen, 10-31 tags/scene), playback
mp4 206.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
User-report (mobilism): scenes are often poorly titled, so saved keyword queries are a useful extra retrieval strategy. New saved_searches table (device-scoped via X-Device-Id, unique per device+query, 50/device cap) + GET/POST/DELETE /saved-searches. Migration 0024. Verified CRUD on prod: add trims+dedups idempotently, empty rejected 422, delete idempotent.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The global source monitor can't catch a single stalled tube because every tube scraper shares one Source row (tube-scraper), so an aggregate run still reports success while one origin freezes (freshporno browsing the rotating KVS homepage root, report 14f3a655). New watchdog checks max(created_at) per active browse-scraper origin (tube:<sitetag>); if a tube with history hasn't produced a new scene in > max_age_hours it fires a Sentry message with a stable per-origin fingerprint (age in extras, not the title, so it stays one grouped issue). Runs every 6h, 48h threshold, both env-tunable (GOON_SCHED_INGEST_WATCHDOG_HOURS / GOON_INGEST_WATCHDOG_MAX_AGE_HOURS). Verified: 0 stale at 48h post-fix, detects neporn at a strict 12h threshold.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The homepage root / is a KVS page with cache-control: no-store and a fresh PHPSESSID per request; the server rotates its featured block and on a cold session can serve an old set instead of the newest scenes. Result: browse-latest skipped everything for 3 days (root served 20 May content), no new freshporno scenes since 12 Jun (user report). Switch _listing_url to the explicit date-sorted /latest-updates/ feed (pagination /latest-updates/N/), which is not subject to that rotation.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The one-off cleanup merged ~13.5k same-video-different-title dupes, but they regrow as
these sibling tubes re-ingest under new titles. Wire the asset-id+duration merge into
the scheduler (every 12h, GOON_SCHED_THUMB_DEDUP_HOURS, 0=off) so it stays clean.
Shared logic lives in app/scheduler/thumb_dedup.py (run_thumb_asset_dedup); the one-shot
script now imports it. Same tight signature as the cleanup: family hosts only + identical
duration (the bare asset-id number is reused across unrelated CDNs, so cross-host/diff-
duration grouping is excluded). Reports 205b17d9 / 5a2944cb.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Reverse-engineered the migrated 4k69 player: jwplayer now serves OK.ru CDN (okcdn.ru)
mp4s. The static page (SSR behind Cloudflare, fetched via proxy) carries "file"+"label"
pairs for every quality. okcdn's srcIp param is NOT enforced (cross-IP test 2026-06-14:
206 video/mp4 from a residential IP != srcIp), so the URL plays from any IP. Parse the
okcdn sources server-side and return them mobile_direct_ok — the phone plays the direct
video, no WebView, no VAST preroll, no age-gate, zero VPS proxy. Skips 4K/2K. Reverts
the brief _vps_blocked_fallback routing (WebView grabbed the preroll ad, not content).
Verified on emulator: native player streams the actual scene (report 5de3fbc5).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
4k69 swapped its player from get_file (4kporno.xxx) to jwplayer + okcdn.ru, whose token
carries srcIp= (IP-bound); the site is also behind Cloudflare (VPS fetch only via proxy).
The native get_file extractor matched nothing and returned None, surfacing as a "host
problem" error even though the video plays fine (report 5de3fbc5). Switch 4k69com to
_vps_blocked_fallback: the on-device WebView (residential IP) clears Cloudflare, the
okcdn token binds to the phone IP, and INJECTED_JS hands the jwplayer source to ExoPlayer.
fourk69.extract stays in the module in case the site reverts to get_file.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
enrich-thumbnail was fill-only (skipped scenes that already had a thumbnail), so a
broken or stale preview (rotting sxyprn/trafficdeposit) could not be refreshed. Add a
force flag that re-fetches the source page and overwrites the existing thumbnail.
Backs the new "Refresh thumbnail" button (report d3376a71).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Porntrex soft-deletes: a removed video returns HTTP 200 with a "this video was deleted"
message instead of a player, so extract returned [] (transient) and the source was never
marked dead, leaving users on a permanently broken link (report 75dbf53e). Match the
deletion message and raise HosterDead so resolve marks the source dead.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Some sources (sexlikereal) build a giant `code`/`director` from a multi-performer
compilation title, overflowing scenes.code varchar(128) -> StringDataRightTruncation,
and the scene silently dropped from ingest. Cap both at the column limit in
_create_canonical and the fill path; code/director are stored metadata, not match keys,
so truncation is safe.
Fixes GOON-J
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Reports were anonymous and one-way. Tie each report to the submitting device
(X-Device-Id), add an admin response back-channel, and let the app fetch replies for
its own device:
- migration 0023: bug_reports gains device_id, response, responded_at, response_seen.
- create_bug_report captures device_id.
- GET /bug-reports/mine (device-scoped) returns this device's reports + unseen count.
- POST /bug-reports/mine/seen clears the unseen flag.
- POST /bug-reports/{id}/reply sets the admin response (authored during triage).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
A source (TPDB) returned a performer alias containing a literal U+0000 ("Ramon..").
Postgres cannot store in JSONB or text, so the external_records JSONB insert in
_upsert_external_record failed with UntranslatableCharacter and the scene never ingested
(GOON-Z). Recursively strip NUL from the raw payload (-> external_records.raw) and, when
present, also re-validate the RawScene/RawMovie so normalize -> typed text columns get
clean data too. Gated by a cheap _has_nul scan so clean records (the overwhelming
majority) pay no extra cost.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Time-bound HLS hosters whose manifest URL lacks a .m3u8 extension (e.g. pornhat's
"...mp4,?..." path) were mis-detected by ExoPlayer as progressive MP4 and failed,
forcing a full proxy fallback that streamed the whole video through the server. Serve
such manifests via /proxy/hls/<token>/play.m3u8 with child URLs left absolute on the
CDN, so the device fetches variant+segments directly and only the ~1KB manifest is
proxied. Routed only for mobile_direct_ok (time-bound) HLS without a .m3u8 path.
Also swallow httpx.TransportError in the stream proxy body generator: an upstream CDN
closing the connection mid-stream is benign (client just retries a range) and should
not surface as an unhandled error.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
It is an ISP proxy (static ISP IPs, flat billing), not residential —
so HTML-ingest bandwidth is free and the full deep-crawl is fine.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
superporn hard-blocks the VPS IP with Cloudflare 403 on every TLS
impersonation, so HTML ingest routes through Bright Data residential
(BRIGHTDATA_PROXY_URL, parsed in config). First scraper to use a proxy:
optional _proxy on the browse base, threaded into browser_get.
JSON-LD VideoObject (title/desc/uploadDate/thumb/duration) + pornstar
and category chips; superporn double-encodes HTML entities so titles
are unescaped twice. Thumbnails fetch fine from the VPS (no proxy).
Playback stays off-proxy: the <source> mp4 token is IP-bound to the
fetcher, so resolve is phone-side via WebView (extractor superporncom
-> _vps_blocked_fallback), same as porndoe.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
4k69.com (~65k scenes): same PlayTube CMS as hqfap - common logic moved
to _playtube.py (sitemap catalog, JSON-LD, pills). Studio classified by
matching category pills against the studios index page. Streams are
get_file (fullmovies family) returned unresolved with mobile_direct,
2160p skipped.
neporn.com: KVS engine, latest-updates listing, JSON-LD + video:duration
meta, performers from models links with flashvars video_tags fallback
for fresh uploads. Resolve via _kvs; final URL portable cross-IP.
superporn.com rejected: Cloudflare 403 from VPS on all TLS impersonations.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
PlayTube CMS. Sitemap-based pagination (listing has no GET paging),
JSON-LD VideoObject metadata, pornstar/category pills, " Clips"
categories mapped to studio. Direct mp4 (cdnde.com/okcdn.ru), tokens
time-bound and portable cross-IP, so mobile plays direct.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
trafficdeposit poster tokens live ~1h (hour-bucketed), so stored URLs can't persist.
New GET /proxy/sxyprn-thumb/{post_id}: resolves the current og:image from the live
/post/<id> page (cache resolved poster URL ~40min), streams bytes with Referer +
long client Cache-Control (URL is stable per post_id → client disk-caches the image,
backend fetches each post ~once). Deleted posts ("Post Not Found") → 404.
Scene grid now emits /proxy/sxyprn-thumb/<id> for sxyprn sources (derived from
page_url) instead of the dead stored trafficdeposit URL. Verified: live post → 200
image, deleted → 404, grid emits resolver URL. Backend-only, no OTA.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
CORRECTION: trafficdeposit thumbnail tokens are hour-bucketed and valid only ~1h
(verified 2026-06-10: stored ts=11:00 dead at 12:27, current ts=13:00 loads). Earlier
"~weekly rot" read was wrong. Storing/periodically-refreshing sxyprn thumbnail URLs
is futile — they expire within the hour. Default the refresh job OFF (kept in code).
The dead-marking sweep (Post Not Found → dead_at) it performed was still valid. Live
sxyprn thumbnails need on-demand resolution at serve time (future work).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
CORRECTION to earlier "unrecoverable" call: the /post/<id> page is alive (200) and
DOES expose the scene's own fresh-signed poster via og:image / <video poster>
(post-id embedded, current timestamp) — only the STORED thumbnail URL had rotted.
Search/listings don't re-surface old posts (0 overlap), but per-post fetch works.
scripts/refresh_sxyprn_thumbs.py: iterate live sxyprn sources, fetch post page,
extract fresh og:image, UPDATE thumbnail_url (verified: refreshed URLs return 200).
_job_refresh_sxyprn_thumbs: every 12h refresh the 1200 least-recently-updated sources
(cycles the ~19k catalog within the expiry window). Pairs with the scene_resolver
overwrite fix so refreshed thumbnails stick.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
_upsert_playback_sources only set thumbnail_url when the existing value was NULL,
so signed CDN thumbnails that ROT (sxyprn/trafficdeposit tokens expire ~weekly →
404) were never replaced even when a fresh re-scrape captured a valid URL — making
the rot permanent (bug 2026-06-10). Always overwrite thumbnail_url/animated_thumbnail_url
with the freshly-scraped value when present; other fields keep fill-if-null. Lets
the regular performer-driven ingest self-heal thumbnails for re-crawled scenes.
(Note: old sxyprn backlog can't be bulk-refreshed — search/listings don't re-surface
those posts, verified 0 overlap — so it's forward-looking; old sxyprn-only scenes
fall back to the clean placeholder.)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
sxyprn thumbnails are time-signed on trafficdeposit CDN and ROT — the signed asset
404s after ~weeks and can't be re-signed/refreshed server-side (bug 2026-06-10,
~15k sxyprn-only scenes showed broken thumbs). In the light-list slim-thumbnail pick,
prefer a thumbnail from any non-trafficdeposit source; fall back to sxyprn only when
it's the scene's sole thumbnail (recent ones still load; dead ones now render a clean
placeholder client-side instead of a broken image).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Public instance has no accounts, so all user state was GLOBAL in DB — new users
saw/overwrote each other's (and Jan's) favorites, watched badges and blacklists
(bug 2026-06-10). Add device_id (VARCHAR 64) to 9 state tables with composite PK
(device_id, entity_id); app sends X-Device-Id header (get_device_id dep). All
favorites/scene-favorites/blacklist/watch + scene&movie list/detail (is_favorite,
watched, blacklist-hide) now filter by device. Existing rows backfilled to
'legacy-shared'; POST /me/adopt-legacy reassigns them to the caller once. Old
clients (no header) map to legacy-shared so they keep working until OTA updates.
Migration 0022: add col, backfill, composite PK. Verified on prod: 967 progress
rows preserved, device isolation holds (new device sees none of legacy state).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The hold-to-preview gesture is being removed (did nothing useful), and no client
sends this filter. Remove the Query param, its EXISTS filter, and the pure-default
count guard reference.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
POST /scenes/{id}/hide — marks all playback_sources dead so the scene drops out
of has_playback lists (reversible via dead_at; row kept for dedup/refs).
POST /scenes/{keep_id}/merge/{drop_id} — merges drop into keep via scene_merge
(moves refs/performers/tags/fingerprints/playback). Backs the new tile long-press
menu (hide / mark-duplicate) replacing the dead animated-preview gesture.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
make_token embedded the current timestamp in the expiry, so every /scenes fetch
produced a DIFFERENT proxied URL for the same thumbnail → expo-image (keyed by URI)
cache-missed and re-downloaded every list load / app launch. Add stable_bucket_sec:
quantize the expiry base to a window so the URL is identical across requests.
_wrap_image_proxy uses a 7-day bucket → thumbnails disk-cache for a week instead of
re-fetching constantly. Answers "czy miniatury są cache'owane" — now yes.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
xhamster: move from WebView fallback to server-side native HLS. The scene page
is fetchable server-side and the xhcdn master m3u8 (variants + segments) is
time-bound, not IP-bound (verified cross-IP), so mobile plays the HLS direct
with zero proxy bandwidth. New tubes/xhamster.py pulls the master m3u8 from
SSR HTML and returns type='m3u8' mobile_direct; registry remaps xhamstercom
off _vps_blocked_fallback.
hqporner: add flyflv to the player-iframe host whitelist. hqporner rotated
some players to flyflv.com; the CDN host was already whitelisted but the iframe
host was not, so those scenes returned no stream.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
bug-report 1a4bf258: "Re-scrape mógłby zniknąć, za to tagi/kategorie by mogły".
Re-scrape was a dev-only bulk thumbnail/tag enrich — noise on the performer page
(per-scene enrich already happens on SceneDetail). Removed it; kept Search.
New GET /performers/{id}/tags aggregates scene_tags across the performer's
live-playback scenes (top N). PerformerScenes renders them as chips → tap navigates
to TagScenes. Search button widened to full row.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
A merged scene often aggregates several uploads from ONE tube (re-encodes / 4K
dups). bug-report aa79a995 "why 2 links, both porntrex?" = same scene std + 4K
(porntrex 2591377 + 2593449 "...in 4K"). In the UI these are indistinguishable
links to one hoster (same extractor). Keep one best per origin: prefer duration
matching the scene → any duration → first (origin-asc stable). Dead already filtered.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Chrome-DevTools investigation of bug-report 827a50a1 (sxyland "long loading,
then webview, no autoplay") showed sxyland embeds playmogo.com/e/<id> — a
DoodStream clone (doodcdn.io infra, pass_md5 protocol, get_slides) behind an
INVISIBLE Cloudflare Turnstile (not an interactive CAPTCHA; auto-passes in a
real browser/WebView from a residential IP). The sxyland page itself is NOT
Turnstile-gated — VPS curl pulls the playmogo iframe URL straight from the HTML.
sxylandcom was wired to _vps_blocked_fallback → phone loaded the entire sxyland
page in WebView (ads, click-to-play, no autoplay = the reported symptom), and the
playmogo embed never reached the phone's dood resolver. _embed_iframe (which
already lists sxyland in its docstring) extracts the playmogo embed and emits it
as type='hoster' → PlayerScreen routes playmogo URLs to doodstream.ts (resolveDoodStream),
which resolves phone-side (phone IP passes invisible Turnstile) → direct mp4 → autoplay.
Mobile unchanged (hoster→dood path already exists for xmoviesforyou/siska). Backend-only.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
merge_scenes never reassigned playback_sources → ON DELETE CASCADE dropped them
with the absorbed scene. Cross-source (canonical) merges rarely had tube playback
so it hid, but tube-dup merges silently LOST playback links. Add _move_playback_sources
(global unique (origin,page_url) guarantees no collision on reassign).
+ merge_exact_title_duration.py: catches missing-merge dupes bulk_dedup misses
(same performer + identical normalized title + identical duration_sec, no phash).
Bad Bella had 25 such pairs (bug-report ef92809d "duplikat, te same miniatury").
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
xvideos renders the scene's models as `<a href="/models/slug">...<span class="name">
Display Name</span>...`. The old _MODEL_RE wanted text immediately after the anchor
`>` and never matched current markup → browse-scraped scenes landed with 0 performers
(bug-report 2026-06-07: "no actors, but Rebecca Johnson is on the page"). New regex
captures slug + nested span.name, bounded within the anchor. + backfill script for the
~11.9k existing zero-performer xvideos scenes (54% have a real /models/ link; resolver
merges names to canonical by name_normalized).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
_candidate used OR logic (studio OR date±7d OR dur±30s) → 938,950 pairs;
Etap-2 scoring at ~110/s never finished in 1800s → bulk_dedup_performers HUNG
every run, orphan thread leaked until restart. Require AND: same studio plus
(date±2d OR dur±30s). 939k→16k pairs, full run 213s. Real cross-source dup of
one master shares studio + near date/duration; rare studio_id-mismatch pairs
skipped on purpose — a job that COMPLETES beats one that times out merging nothing.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
hqporner search post-filter kept a scene if its slug contained ANY query token
(>=3 chars). For multi-word performer names this matched on a single common token
(e.g. "anna","mia"), so the performer-driven ingest attributed the scene to EVERY
performer sharing that token — scenes accumulated up to 503 wrong performers
(hqporner = 5659 of 5897 scenes with >30 performers; bug-reports 2026-06-07).
Switch ANY->ALL: the slug must contain every query token, requiring a full name
match before attribution. Single-word names still work. Precision over recall —
144 wrong performers is far worse than missing a few loose matches.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Tag-filtered scene lists (e.g. blowjob + has_playback) took 4-12s. Root cause:
the filter joined scene_tags->tags on slug, so the actual tag_id was opaque to
the planner at plan time. It fell back to average per-tag cardinality
(8.4M/11541 ≈ 726) instead of the real 273k, chose to materialize ALL matching
scene_tags + check playback per row, then top-N sort.
Fix: resolve slug->tag_id in the app and filter on a LITERAL tag_id (no slug
join). With a constant, the planner uses MCV stats, knows the tag is huge, and
walks ix_scenes_created_at_desc probing scene_tags/playback per scene, stopping
at the page limit. Verified: blowjob list 3300ms -> 18ms (EXPLAIN), HTTP 4-12s ->
47ms. Unknown slug short-circuits to empty. (Pairs with the raised tag_id
statistics target so mid-tier tags also get correct estimates.)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Scene list returned the full SceneOut per item (nested tags/external_refs + all
playback_sources with page_url/embed/stream/quality) though SceneTile only reads
the thumbnail + title/duration/performer/studio, and SceneDetail re-fetches the
full scene via /scenes/{id}. Added light=True to _build_scenes_out_batch: skip the
tags + external_refs queries entirely and collapse playback_sources to one slim
entry (thumbnail_url + animated_thumbnail_url only).
Result: default list payload 78KB->48KB (-38%), ~28ms cached, less DB work per
list. Verified on emulator: grid thumbnails/durations/titles render unchanged.
No mobile change (tile reads the same fields); server-side, no OTA.
NOTE: the separate slow path — common-tag-filtered lists (4-12s; query expands all
matching scene_tags before sort/limit) — is structural (needs a denormalized
(tag_id, created_at) index) and deferred. VACUUM ANALYZE + raised tag_id stats
applied but the planner still can't avoid the materialization.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
bug-report 2026-06-03 ("ten sam czas, ta sama miniaturka, czemu się nie mergują"):
duplicate scenes not merged at ingest. Exact phash alone is noisy here (95% are
collisions on shared thumbnails/intro frames — different scenes; bulk_dedup scorer
correctly gives 0 auto-merge). The safe subset is exact-phash AND same duration
(±3s) AND shared performer/title — near-certain same scene. Same-duration is key:
it excludes the false-merge pattern (short-clip-vs-full has DIFFERING durations).
- scripts/merge_phash_exact_dupes.py: one-off, dry-run by default, per-pair re-fetch
(handles clusters). Applied: 30 merged.
- bulk_dedup: add `_pairs_exact_phash` (SQL O(N log N), not the O(N²) Hamming scan)
+ strategy "phash_exact" — gated by the normal scorer (surfaces review candidates,
no risky auto-merge), schedulable for ongoing exact-collision review.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The umbrella Source.name for all direct tube scrapers (deep-crawl, browse-latest,
performer-driven) was "pornapp" — a misleading leftover from the removed external
porn-app API. It read like a dependency on a third-party "pornapp" service; it is
not — these are our own scrapers hitting 25+ tubes directly (kind=scraper,
origin tube:<sitetag>). Renamed to "tube-scraper" via a single SCRAPER_SOURCE_NAME
constant; DB row renamed in place (UPDATE name, same id) so all ingest_runs +
external_records history stays linked. No behavior change — external_id keying
(sitetag:url) and dedup are unaffected.
NOTE: playback_sources.origin "pornapp:<sitetag>" prefix is a separate legacy
format (resolve_playback parses it) and is intentionally left untouched.
Verified on prod: row renamed (0 stray "pornapp"), new runs land on "tube-scraper".
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Two bug-report fixes (2026-06-07):
- sxyprn returns HTTP 200 "Post Not Found" for deleted posts (soft-404), so the
extractor returned None → resolve treated it as transient and never marked the
source dead, leaving a dead link offered forever. Now raise HosterDead on the
marker so resolve marks it dead.
- Scene playback sources were ordered alphabetically by origin, so a WebView-
fallback hoster (fpoxxx, IP-bound + ad-heavy) ranked above a working native
source (freshporno) on the same scene. Add is_vps_blocked_fallback() and sort
native-resolve origins ahead of WebView-fallback ones.
Verified on prod: sxyprn dead URL → HosterDead; scene sources reorder
freshpornoorg before fpoxxx.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
_job_bulk_dedup_performers called run_bulk_dedup(strategy="performers") without
the cross_source_only guard whose docstring exists precisely to prevent this OOM.
At current catalog scale the unguarded path materializes N²/2 pairs per prolific
performer into a list → worker hit 6GB RSS and was OOM-killed every 12h (05:00/
17:00), taking down concurrent tpdb/stashdb/movie ingests as killed_by_restart
(0 new movies). Verified in prod: 05:00 run now completes (885k pairs scored, no
OOM) and ingests succeed (stashdb +241, tpdb +175).
Also wrap in _run_with_timeout like tpdb/stashdb (job had no hard-timeout).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
TPDB taxonomy emits numbered-duplicate tags (name "Bubble Butt2"); slugify
yields "bubble-butt2" (no separator before digit), so resolve_tag created a
separate tag alongside "bubble-butt". Tube scenes inherited the dup via
scene-merge → 75 pairs, ~10k scene_tags on the wrong tag.
- resolve_tag: canonicalize "<base>2" -> "<base>" when base exists (handles
current + future; trailing-"2"+alpha guard leaves milf-30/teen18 intact)
- scripts/merge_dup2_tags.py: one-off bulk merge (scene_tags + movie_tags +
blacklist) and taxonomy-count refresh
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>