Commit graph

82 commits

Author SHA1 Message Date
jtrzupek
e0e69189a8 fix(sxyprn): revive search via performer pages + rich metadata
sxyprn ingest was frozen since 05-07: the old ?type=videos&query= endpoint
returns trending (not performer-filtered), so the strict token filter
correctly dropped everything -> 0 ingest. Real "search" is the performer
page /<First-Last>.html. Rewrote search() to scrape those cards: clean
performer (the query, avoids sxyprn's Dallas/Rae name fragmentation),
studio (channel subcat), tags (#hashtags), duration, thumbnail. Token
filter now runs on the card title so only genuine matches attach the
performer. Verified: Lana Rhoades/Riley Reid/Angela White return results,
metadata persists in DB (studio e.g. Vixen, 10-31 tags/scene), playback
mp4 206.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 22:58:52 +02:00
jtrzupek
bcee5851e9 feat(api): per-device saved searches (keyword favorites)
User-report (mobilism): scenes are often poorly titled, so saved keyword queries are a useful extra retrieval strategy. New saved_searches table (device-scoped via X-Device-Id, unique per device+query, 50/device cap) + GET/POST/DELETE /saved-searches. Migration 0024. Verified CRUD on prod: add trims+dedups idempotently, empty rejected 422, delete idempotent.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-16 13:52:18 +02:00
jtrzupek
0424cb9138 feat(scheduler): per-origin ingest freshness watchdog -> Sentry
The global source monitor can't catch a single stalled tube because every tube scraper shares one Source row (tube-scraper), so an aggregate run still reports success while one origin freezes (freshporno browsing the rotating KVS homepage root, report 14f3a655). New watchdog checks max(created_at) per active browse-scraper origin (tube:<sitetag>); if a tube with history hasn't produced a new scene in > max_age_hours it fires a Sentry message with a stable per-origin fingerprint (age in extras, not the title, so it stays one grouped issue). Runs every 6h, 48h threshold, both env-tunable (GOON_SCHED_INGEST_WATCHDOG_HOURS / GOON_INGEST_WATCHDOG_MAX_AGE_HOURS). Verified: 0 stale at 48h post-fix, detects neporn at a strict 12h threshold.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-15 10:26:25 +02:00
jtrzupek
4b71689a95 fix(scrapers): freshporno browse from /latest-updates/ not homepage root
The homepage root / is a KVS page with cache-control: no-store and a fresh PHPSESSID per request; the server rotates its featured block and on a cold session can serve an old set instead of the newest scenes. Result: browse-latest skipped everything for 3 days (root served 20 May content), no new freshporno scenes since 12 Jun (user report). Switch _listing_url to the explicit date-sorted /latest-updates/ feed (pagination /latest-updates/N/), which is not subject to that rotation.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-15 09:59:40 +02:00
jtrzupek
8b4783771f feat(scheduler): periodic thumb-asset dedup (hdporn.gg/fullmovies.xxx)
The one-off cleanup merged ~13.5k same-video-different-title dupes, but they regrow as
these sibling tubes re-ingest under new titles. Wire the asset-id+duration merge into
the scheduler (every 12h, GOON_SCHED_THUMB_DEDUP_HOURS, 0=off) so it stays clean.

Shared logic lives in app/scheduler/thumb_dedup.py (run_thumb_asset_dedup); the one-shot
script now imports it. Same tight signature as the cleanup: family hosts only + identical
duration (the bare asset-id number is reused across unrelated CDNs, so cross-host/diff-
duration grouping is excluded). Reports 205b17d9 / 5a2944cb.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-14 14:56:45 +02:00
jtrzupek
81d617efc2 fix(extractors): 4k69 direct okcdn extraction (replaces WebView fallback)
Reverse-engineered the migrated 4k69 player: jwplayer now serves OK.ru CDN (okcdn.ru)
mp4s. The static page (SSR behind Cloudflare, fetched via proxy) carries "file"+"label"
pairs for every quality. okcdn's srcIp param is NOT enforced (cross-IP test 2026-06-14:
206 video/mp4 from a residential IP != srcIp), so the URL plays from any IP. Parse the
okcdn sources server-side and return them mobile_direct_ok — the phone plays the direct
video, no WebView, no VAST preroll, no age-gate, zero VPS proxy. Skips 4K/2K. Reverts
the brief _vps_blocked_fallback routing (WebView grabbed the preroll ad, not content).
Verified on emulator: native player streams the actual scene (report 5de3fbc5).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-14 11:39:36 +02:00
jtrzupek
29da1fbaa6 fix(extractors): route 4k69 to WebView fallback after player migration
4k69 swapped its player from get_file (4kporno.xxx) to jwplayer + okcdn.ru, whose token
carries srcIp= (IP-bound); the site is also behind Cloudflare (VPS fetch only via proxy).
The native get_file extractor matched nothing and returned None, surfacing as a "host
problem" error even though the video plays fine (report 5de3fbc5). Switch 4k69com to
_vps_blocked_fallback: the on-device WebView (residential IP) clears Cloudflare, the
okcdn token binds to the phone IP, and INJECTED_JS hands the jwplayer source to ExoPlayer.
fourk69.extract stays in the module in case the site reverts to get_file.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-14 11:17:18 +02:00
jtrzupek
e512665d26 feat(scenes): force-refresh thumbnail via enrich-thumbnail ?force
enrich-thumbnail was fill-only (skipped scenes that already had a thumbnail), so a
broken or stale preview (rotting sxyprn/trafficdeposit) could not be refreshed. Add a
force flag that re-fetches the source page and overwrites the existing thumbnail.
Backs the new "Refresh thumbnail" button (report d3376a71).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-13 19:04:10 +02:00
jtrzupek
32919d6a6c feat(extractors): detect deleted porntrex videos and mark dead
Porntrex soft-deletes: a removed video returns HTTP 200 with a "this video was deleted"
message instead of a player, so extract returned [] (transient) and the source was never
marked dead, leaving users on a permanently broken link (report 75dbf53e). Match the
deletion message and raise HosterDead so resolve marks the source dead.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-13 19:04:10 +02:00
jtrzupek
9d4384cef3 fix(ingest): cap code/director to column length (GOON-J)
Some sources (sexlikereal) build a giant `code`/`director` from a multi-performer
compilation title, overflowing scenes.code varchar(128) -> StringDataRightTruncation,
and the scene silently dropped from ingest. Cap both at the column limit in
_create_canonical and the fill path; code/director are stored metadata, not match keys,
so truncation is safe.

Fixes GOON-J

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-13 19:04:10 +02:00
jtrzupek
d1f2f035b0 feat(bug-reports): two-way replies (device-scoped) + admin reply endpoint
Reports were anonymous and one-way. Tie each report to the submitting device
(X-Device-Id), add an admin response back-channel, and let the app fetch replies for
its own device:
- migration 0023: bug_reports gains device_id, response, responded_at, response_seen.
- create_bug_report captures device_id.
- GET /bug-reports/mine (device-scoped) returns this device's reports + unseen count.
- POST /bug-reports/mine/seen clears the unseen flag.
- POST /bug-reports/{id}/reply sets the admin response (authored during triage).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 11:35:44 +02:00
jtrzupek
1654d78d59 fix(ingest): strip NUL bytes from raw payloads before Postgres write
A source (TPDB) returned a performer alias containing a literal U+0000 ("Ramon..").
Postgres cannot store  in JSONB or text, so the external_records JSONB insert in
_upsert_external_record failed with UntranslatableCharacter and the scene never ingested
(GOON-Z). Recursively strip NUL from the raw payload (-> external_records.raw) and, when
present, also re-validate the RawScene/RawMovie so normalize -> typed text columns get
clean data too. Gated by a cheap _has_nul scan so clean records (the overwhelming
majority) pay no extra cost.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 19:48:22 +02:00
jtrzupek
aa05ce2647 feat(playback): direct-HLS manifest passthrough + proxy stream drop handling
Time-bound HLS hosters whose manifest URL lacks a .m3u8 extension (e.g. pornhat's
"...mp4,?..." path) were mis-detected by ExoPlayer as progressive MP4 and failed,
forcing a full proxy fallback that streamed the whole video through the server. Serve
such manifests via /proxy/hls/<token>/play.m3u8 with child URLs left absolute on the
CDN, so the device fetches variant+segments directly and only the ~1KB manifest is
proxied. Routed only for mobile_direct_ok (time-bound) HLS without a .m3u8 path.

Also swallow httpx.TransportError in the stream proxy body generator: an upstream CDN
closing the connection mid-stream is benign (client just retries a range) and should
not surface as an unhandled error.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 16:14:25 +02:00
jtrzupek
956a0feb22 docs: correct Bright Data proxy type (ISP, flat-rate not per-GB)
It is an ISP proxy (static ISP IPs, flat billing), not residential —
so HTML-ingest bandwidth is free and the full deep-crawl is fine.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 19:18:40 +02:00
jtrzupek
21bc8bf1fe feat(superporn): browse scraper via Bright Data residential proxy
superporn hard-blocks the VPS IP with Cloudflare 403 on every TLS
impersonation, so HTML ingest routes through Bright Data residential
(BRIGHTDATA_PROXY_URL, parsed in config). First scraper to use a proxy:
optional _proxy on the browse base, threaded into browser_get.

JSON-LD VideoObject (title/desc/uploadDate/thumb/duration) + pornstar
and category chips; superporn double-encodes HTML entities so titles
are unescaped twice. Thumbnails fetch fine from the VPS (no proxy).

Playback stays off-proxy: the <source> mp4 token is IP-bound to the
fetcher, so resolve is phone-side via WebView (extractor superporncom
-> _vps_blocked_fallback), same as porndoe.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 18:47:45 +02:00
jtrzupek
80fd83cb4e feat(tubes): add 4k69 + neporn browse scrapers, shared PlayTube base
4k69.com (~65k scenes): same PlayTube CMS as hqfap - common logic moved
to _playtube.py (sitemap catalog, JSON-LD, pills). Studio classified by
matching category pills against the studios index page. Streams are
get_file (fullmovies family) returned unresolved with mobile_direct,
2160p skipped.

neporn.com: KVS engine, latest-updates listing, JSON-LD + video:duration
meta, performers from models links with flashvars video_tags fallback
for fresh uploads. Resolve via _kvs; final URL portable cross-IP.

superporn.com rejected: Cloudflare 403 from VPS on all TLS impersonations.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 18:15:13 +02:00
jtrzupek
6de986b9a7 feat(hqfap): browse scraper + native mp4 extractor (~120k scenes)
PlayTube CMS. Sitemap-based pagination (listing has no GET paging),
JSON-LD VideoObject metadata, pornstar/category pills, " Clips"
categories mapped to studio. Direct mp4 (cdnde.com/okcdn.ru), tokens
time-bound and portable cross-IP, so mobile plays direct.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 17:51:04 +02:00
jtrzupek
08079787da feat(sxyprn): on-demand thumbnail resolver (live posters, ~1h-TTL workaround)
trafficdeposit poster tokens live ~1h (hour-bucketed), so stored URLs can't persist.
New GET /proxy/sxyprn-thumb/{post_id}: resolves the current og:image from the live
/post/<id> page (cache resolved poster URL ~40min), streams bytes with Referer +
long client Cache-Control (URL is stable per post_id → client disk-caches the image,
backend fetches each post ~once). Deleted posts ("Post Not Found") → 404.

Scene grid now emits /proxy/sxyprn-thumb/<id> for sxyprn sources (derived from
page_url) instead of the dead stored trafficdeposit URL. Verified: live post → 200
image, deleted → 404, grid emits resolver URL. Backend-only, no OTA.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-10 15:02:49 +02:00
jtrzupek
f7670963df fix(sxyprn): disable thumbnail refresh job — trafficdeposit token has ~1h TTL
CORRECTION: trafficdeposit thumbnail tokens are hour-bucketed and valid only ~1h
(verified 2026-06-10: stored ts=11:00 dead at 12:27, current ts=13:00 loads). Earlier
"~weekly rot" read was wrong. Storing/periodically-refreshing sxyprn thumbnail URLs
is futile — they expire within the hour. Default the refresh job OFF (kept in code).
The dead-marking sweep (Post Not Found → dead_at) it performed was still valid. Live
sxyprn thumbnails need on-demand resolution at serve time (future work).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-10 14:29:24 +02:00
jtrzupek
fef28ae56b feat(sxyprn): refresh rotting thumbnails from live post pages + scheduled job
CORRECTION to earlier "unrecoverable" call: the /post/<id> page is alive (200) and
DOES expose the scene's own fresh-signed poster via og:image / <video poster>
(post-id embedded, current timestamp) — only the STORED thumbnail URL had rotted.
Search/listings don't re-surface old posts (0 overlap), but per-post fetch works.

scripts/refresh_sxyprn_thumbs.py: iterate live sxyprn sources, fetch post page,
extract fresh og:image, UPDATE thumbnail_url (verified: refreshed URLs return 200).
_job_refresh_sxyprn_thumbs: every 12h refresh the 1200 least-recently-updated sources
(cycles the ~19k catalog within the expiry window). Pairs with the scene_resolver
overwrite fix so refreshed thumbnails stick.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-10 10:36:30 +02:00
jtrzupek
bb9e1afc31 fix(resolver): refresh thumbnails on re-scrape instead of fill-only-if-null
_upsert_playback_sources only set thumbnail_url when the existing value was NULL,
so signed CDN thumbnails that ROT (sxyprn/trafficdeposit tokens expire ~weekly →
404) were never replaced even when a fresh re-scrape captured a valid URL — making
the rot permanent (bug 2026-06-10). Always overwrite thumbnail_url/animated_thumbnail_url
with the freshly-scraped value when present; other fields keep fill-if-null. Lets
the regular performer-driven ingest self-heal thumbnails for re-crawled scenes.

(Note: old sxyprn backlog can't be bulk-refreshed — search/listings don't re-surface
those posts, verified 0 overlap — so it's forward-looking; old sxyprn-only scenes
fall back to the clean placeholder.)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-10 10:28:18 +02:00
jtrzupek
adbdce1c75 fix(api): de-prioritize rotting sxyprn/trafficdeposit thumbnails
sxyprn thumbnails are time-signed on trafficdeposit CDN and ROT — the signed asset
404s after ~weeks and can't be re-signed/refreshed server-side (bug 2026-06-10,
~15k sxyprn-only scenes showed broken thumbs). In the light-list slim-thumbnail pick,
prefer a thumbnail from any non-trafficdeposit source; fall back to sxyprn only when
it's the scene's sole thumbnail (recent ones still load; dead ones now render a clean
placeholder client-side instead of a broken image).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-10 10:11:10 +02:00
jtrzupek
c8baa11604 feat(api): device-scope user state (favorites/progress/blacklists)
Public instance has no accounts, so all user state was GLOBAL in DB — new users
saw/overwrote each other's (and Jan's) favorites, watched badges and blacklists
(bug 2026-06-10). Add device_id (VARCHAR 64) to 9 state tables with composite PK
(device_id, entity_id); app sends X-Device-Id header (get_device_id dep). All
favorites/scene-favorites/blacklist/watch + scene&movie list/detail (is_favorite,
watched, blacklist-hide) now filter by device. Existing rows backfilled to
'legacy-shared'; POST /me/adopt-legacy reassigns them to the caller once. Old
clients (no header) map to legacy-shared so they keep working until OTA updates.

Migration 0022: add col, backfill, composite PK. Verified on prod: 967 progress
rows preserved, device isolation holds (new device sees none of legacy state).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-10 08:58:01 +02:00
jtrzupek
e1c7efb947 chore(api): drop unused has_animated_thumbnail scene filter
The hold-to-preview gesture is being removed (did nothing useful), and no client
sends this filter. Remove the Query param, its EXISTS filter, and the pure-default
count guard reference.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 09:52:15 +02:00
jtrzupek
e98ef6577e feat(api): scene hide + merge-duplicate endpoints for long-press actions
POST /scenes/{id}/hide — marks all playback_sources dead so the scene drops out
of has_playback lists (reversible via dead_at; row kept for dedup/refs).
POST /scenes/{keep_id}/merge/{drop_id} — merges drop into keep via scene_merge
(moves refs/performers/tags/fingerprints/playback). Backs the new tile long-press
menu (hide / mark-duplicate) replacing the dead animated-preview gesture.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 09:47:16 +02:00
jtrzupek
abddd27856 fix(proxy): stable image-proxy URLs so expo-image actually caches thumbnails
make_token embedded the current timestamp in the expiry, so every /scenes fetch
produced a DIFFERENT proxied URL for the same thumbnail → expo-image (keyed by URI)
cache-missed and re-downloaded every list load / app launch. Add stable_bucket_sec:
quantize the expiry base to a window so the URL is identical across requests.
_wrap_image_proxy uses a 7-day bucket → thumbnails disk-cache for a week instead of
re-fetching constantly. Answers "czy miniatury są cache'owane" — now yes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 09:45:22 +02:00
jtrzupek
3e8a221981 feat(extractors): native HLS for xhamster; hqporner flyflv player
xhamster: move from WebView fallback to server-side native HLS. The scene page
is fetchable server-side and the xhcdn master m3u8 (variants + segments) is
time-bound, not IP-bound (verified cross-IP), so mobile plays the HLS direct
with zero proxy bandwidth. New tubes/xhamster.py pulls the master m3u8 from
SSR HTML and returns type='m3u8' mobile_direct; registry remaps xhamstercom
off _vps_blocked_fallback.

hqporner: add flyflv to the player-iframe host whitelist. hqporner rotated
some players to flyflv.com; the CDN host was already whitelisted but the iframe
host was not, so those scenes returned no stream.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 09:35:58 +02:00
jtrzupek
ffb80c7b60 feat(performer): replace dev Re-scrape button with top-tag chips
bug-report 1a4bf258: "Re-scrape mógłby zniknąć, za to tagi/kategorie by mogły".
Re-scrape was a dev-only bulk thumbnail/tag enrich — noise on the performer page
(per-scene enrich already happens on SceneDetail). Removed it; kept Search.

New GET /performers/{id}/tags aggregates scene_tags across the performer's
live-playback scenes (top N). PerformerScenes renders them as chips → tap navigates
to TagScenes. Search button widened to full row.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-08 11:56:26 +02:00
jtrzupek
f8b1e801ef fix(api): collapse same-origin playback sources on scene detail
A merged scene often aggregates several uploads from ONE tube (re-encodes / 4K
dups). bug-report aa79a995 "why 2 links, both porntrex?" = same scene std + 4K
(porntrex 2591377 + 2593449 "...in 4K"). In the UI these are indistinguishable
links to one hoster (same extractor). Keep one best per origin: prefer duration
matching the scene → any duration → first (origin-asc stable). Dead already filtered.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-08 11:50:45 +02:00
jtrzupek
65b9df073a fix(extractors): route sxylandcom through _embed_iframe, not webview fallback
Chrome-DevTools investigation of bug-report 827a50a1 (sxyland "long loading,
then webview, no autoplay") showed sxyland embeds playmogo.com/e/<id> — a
DoodStream clone (doodcdn.io infra, pass_md5 protocol, get_slides) behind an
INVISIBLE Cloudflare Turnstile (not an interactive CAPTCHA; auto-passes in a
real browser/WebView from a residential IP). The sxyland page itself is NOT
Turnstile-gated — VPS curl pulls the playmogo iframe URL straight from the HTML.

sxylandcom was wired to _vps_blocked_fallback → phone loaded the entire sxyland
page in WebView (ads, click-to-play, no autoplay = the reported symptom), and the
playmogo embed never reached the phone's dood resolver. _embed_iframe (which
already lists sxyland in its docstring) extracts the playmogo embed and emits it
as type='hoster' → PlayerScreen routes playmogo URLs to doodstream.ts (resolveDoodStream),
which resolves phone-side (phone IP passes invisible Turnstile) → direct mp4 → autoplay.

Mobile unchanged (hoster→dood path already exists for xmoviesforyou/siska). Backend-only.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-08 11:41:38 +02:00
jtrzupek
e23e2d1f17 fix(merge): move playback_sources on scene merge + exact-title+duration dedup
merge_scenes never reassigned playback_sources → ON DELETE CASCADE dropped them
with the absorbed scene. Cross-source (canonical) merges rarely had tube playback
so it hid, but tube-dup merges silently LOST playback links. Add _move_playback_sources
(global unique (origin,page_url) guarantees no collision on reassign).

+ merge_exact_title_duration.py: catches missing-merge dupes bulk_dedup misses
(same performer + identical normalized title + identical duration_sec, no phash).
Bad Bella had 25 such pairs (bug-report ef92809d "duplikat, te same miniatury").

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-08 10:56:50 +02:00
jtrzupek
7bf1fd6716 fix(xvideos): parse model name from nested span.name — recover 0-performer scenes
xvideos renders the scene's models as `<a href="/models/slug">...<span class="name">
Display Name</span>...`. The old _MODEL_RE wanted text immediately after the anchor
`>` and never matched current markup → browse-scraped scenes landed with 0 performers
(bug-report 2026-06-07: "no actors, but Rebecca Johnson is on the page"). New regex
captures slug + nested span.name, bounded within the anchor. + backfill script for the
~11.9k existing zero-performer xvideos scenes (54% have a real /models/ link; resolver
merges names to canonical by name_normalized).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-08 10:13:21 +02:00
jtrzupek
2b602beea5 fix(dedup): tighten cross-source candidate prefilter — kill 1800s hang (GOON-V)
_candidate used OR logic (studio OR date±7d OR dur±30s) → 938,950 pairs;
Etap-2 scoring at ~110/s never finished in 1800s → bulk_dedup_performers HUNG
every run, orphan thread leaked until restart. Require AND: same studio plus
(date±2d OR dur±30s). 939k→16k pairs, full run 213s. Real cross-source dup of
one master shares studio + near date/duration; rare studio_id-mismatch pairs
skipped on purpose — a job that COMPLETES beats one that times out merging nothing.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-08 10:03:33 +02:00
jtrzupek
cd257740be fix(hqporner): require ALL query tokens in slug — stop performer over-attribution
hqporner search post-filter kept a scene if its slug contained ANY query token
(>=3 chars). For multi-word performer names this matched on a single common token
(e.g. "anna","mia"), so the performer-driven ingest attributed the scene to EVERY
performer sharing that token — scenes accumulated up to 503 wrong performers
(hqporner = 5659 of 5897 scenes with >30 performers; bug-reports 2026-06-07).

Switch ANY->ALL: the slug must contain every query token, requiring a full name
match before attribution. Single-word names still work. Precision over recall —
144 wrong performers is far worse than missing a few loose matches.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-08 09:28:18 +02:00
jtrzupek
43f7e1f7b2 perf(scenes): literal tag_id in filter — 4-12s tag lists -> ~20ms
Tag-filtered scene lists (e.g. blowjob + has_playback) took 4-12s. Root cause:
the filter joined scene_tags->tags on slug, so the actual tag_id was opaque to
the planner at plan time. It fell back to average per-tag cardinality
(8.4M/11541 ≈ 726) instead of the real 273k, chose to materialize ALL matching
scene_tags + check playback per row, then top-N sort.

Fix: resolve slug->tag_id in the app and filter on a LITERAL tag_id (no slug
join). With a constant, the planner uses MCV stats, knows the tag is huge, and
walks ix_scenes_created_at_desc probing scene_tags/playback per scene, stopping
at the page limit. Verified: blowjob list 3300ms -> 18ms (EXPLAIN), HTTP 4-12s ->
47ms. Unknown slug short-circuits to empty. (Pairs with the raised tag_id
statistics target so mid-tier tags also get correct estimates.)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-07 21:10:31 +02:00
jtrzupek
d52641774d perf(scenes): light list payload — drop tags/refs, slim playback to thumbnail
Scene list returned the full SceneOut per item (nested tags/external_refs + all
playback_sources with page_url/embed/stream/quality) though SceneTile only reads
the thumbnail + title/duration/performer/studio, and SceneDetail re-fetches the
full scene via /scenes/{id}. Added light=True to _build_scenes_out_batch: skip the
tags + external_refs queries entirely and collapse playback_sources to one slim
entry (thumbnail_url + animated_thumbnail_url only).

Result: default list payload 78KB->48KB (-38%), ~28ms cached, less DB work per
list. Verified on emulator: grid thumbnails/durations/titles render unchanged.
No mobile change (tile reads the same fields); server-side, no OTA.

NOTE: the separate slow path — common-tag-filtered lists (4-12s; query expands all
matching scene_tags before sort/limit) — is structural (needs a denormalized
(tag_id, created_at) index) and deferred. VACUUM ANALYZE + raised tag_id stats
applied but the planner still can't avoid the materialization.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-07 21:03:26 +02:00
jtrzupek
4922646011 feat(dedup): merge exact-phash + same-duration + shared-performer duplicates
bug-report 2026-06-03 ("ten sam czas, ta sama miniaturka, czemu się nie mergują"):
duplicate scenes not merged at ingest. Exact phash alone is noisy here (95% are
collisions on shared thumbnails/intro frames — different scenes; bulk_dedup scorer
correctly gives 0 auto-merge). The safe subset is exact-phash AND same duration
(±3s) AND shared performer/title — near-certain same scene. Same-duration is key:
it excludes the false-merge pattern (short-clip-vs-full has DIFFERING durations).

- scripts/merge_phash_exact_dupes.py: one-off, dry-run by default, per-pair re-fetch
  (handles clusters). Applied: 30 merged.
- bulk_dedup: add `_pairs_exact_phash` (SQL O(N log N), not the O(N²) Hamming scan)
  + strategy "phash_exact" — gated by the normal scorer (surfaces review candidates,
  no risky auto-merge), schedulable for ongoing exact-collision review.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-07 20:08:06 +02:00
jtrzupek
a196fcbcdb refactor(ingest): rename scraper Source name "pornapp" -> "tube-scraper"
The umbrella Source.name for all direct tube scrapers (deep-crawl, browse-latest,
performer-driven) was "pornapp" — a misleading leftover from the removed external
porn-app API. It read like a dependency on a third-party "pornapp" service; it is
not — these are our own scrapers hitting 25+ tubes directly (kind=scraper,
origin tube:<sitetag>). Renamed to "tube-scraper" via a single SCRAPER_SOURCE_NAME
constant; DB row renamed in place (UPDATE name, same id) so all ingest_runs +
external_records history stays linked. No behavior change — external_id keying
(sitetag:url) and dedup are unaffected.

NOTE: playback_sources.origin "pornapp:<sitetag>" prefix is a separate legacy
format (resolve_playback parses it) and is intentionally left untouched.

Verified on prod: row renamed (0 stray "pornapp"), new runs land on "tube-scraper".

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-07 16:54:55 +02:00
jtrzupek
8c0edbdf7b fix(playback): mark deleted sxyprn posts dead + rank native sources first
Two bug-report fixes (2026-06-07):
- sxyprn returns HTTP 200 "Post Not Found" for deleted posts (soft-404), so the
  extractor returned None → resolve treated it as transient and never marked the
  source dead, leaving a dead link offered forever. Now raise HosterDead on the
  marker so resolve marks it dead.
- Scene playback sources were ordered alphabetically by origin, so a WebView-
  fallback hoster (fpoxxx, IP-bound + ad-heavy) ranked above a working native
  source (freshporno) on the same scene. Add is_vps_blocked_fallback() and sort
  native-resolve origins ahead of WebView-fallback ones.

Verified on prod: sxyprn dead URL → HosterDead; scene sources reorder
freshpornoorg before fpoxxx.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-07 14:09:01 +02:00
jtrzupek
9d0cb7f26e fix(scheduler): bulk_dedup performers cross_source_only + hard-timeout (OOM)
_job_bulk_dedup_performers called run_bulk_dedup(strategy="performers") without
the cross_source_only guard whose docstring exists precisely to prevent this OOM.
At current catalog scale the unguarded path materializes N²/2 pairs per prolific
performer into a list → worker hit 6GB RSS and was OOM-killed every 12h (05:00/
17:00), taking down concurrent tpdb/stashdb/movie ingests as killed_by_restart
(0 new movies). Verified in prod: 05:00 run now completes (885k pairs scored, no
OOM) and ingests succeed (stashdb +241, tpdb +175).

Also wrap in _run_with_timeout like tpdb/stashdb (job had no hard-timeout).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-07 11:00:19 +02:00
jtrzupek
fad72e9cd6 fix(tags): merge <base>2 numbered-duplicate tags + prevent regeneration
TPDB taxonomy emits numbered-duplicate tags (name "Bubble Butt2"); slugify
yields "bubble-butt2" (no separator before digit), so resolve_tag created a
separate tag alongside "bubble-butt". Tube scenes inherited the dup via
scene-merge → 75 pairs, ~10k scene_tags on the wrong tag.

- resolve_tag: canonicalize "<base>2" -> "<base>" when base exists (handles
  current + future; trailing-"2"+alpha guard leaves milf-30/teen18 intact)
- scripts/merge_dup2_tags.py: one-off bulk merge (scene_tags + movie_tags +
  blacklist) and taxonomy-count refresh

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-06 23:18:44 +02:00
jtrzupek
210aec0536 feat(scrapers): extract tags + description from porndish scene pages
porndish-only scenes had no tags and no description — the scraper only derived a
title from the URL slug. The scene page (g1/bimber WP theme) carries both: a
<p class="entry-tags"> list of /video2/<slug>/ links (the "#" tags the user sees,
categories + co-performers) and a prose description <p> in .entry-content.

Override _fetch_scene_metadata in PornDishScraper to pull both from one page
fetch. Extend the base hook to accept an optional 4th return element
(description) and thread it into RawScene.description — backward compatible with
the existing 3-tuple (pornhat). Strips leading embed-button labels
("Video Player N", "Server N") from the prose. Verified on live scenes: clean
tag lists + real descriptions.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-06 21:32:10 +02:00
jtrzupek
83918e9a8d perf(movies+scenes): direct-play #hash movie hosters; skip empty blacklist filters
Movies: the seekplayer-engine family (easyvidplayer/player4me/seekplayer/
embedseek/upns, ~322k sources) returns a time-bound master.m3u8 on a CDN with a
valid IP-SAN cert that plays cross-IP. Mark it mobile_direct in resolve, and make
MovieDetailScreen prefer direct_url with a proxy fallback (mirrors the scene
path) — previously every movie streamed through the VPS proxy. Paradisehill
multipart parts now go direct too. Device-verified: ExoPlayer plays the raw CDN
direct, zero proxy traffic, no flicker.

Scenes: the three blacklist NOT EXISTS clauses were appended to every filtered
list and evaluated per-row even when all blacklist tables are empty (~3.4s tax on
a deep mega-tag walk). Skip them when the tables are empty (cached check) —
mega-tag list 6.7s -> 3.3s, and every filtered list benefits.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-06 19:44:41 +02:00
jtrzupek
e780e1ae6f fix(hdporngg+fullmovies): native get_file, skip broken 4K — "loading forever"
User: "hdporngg loading forever". DevTools + cross-IP investigation (not guessing):
- site is alive (sample scenes 200; the one earlier 404 was a single removed video,
  not the site — my earlier "site dead" was a hasty generalization).
- both are the same platform (<source src=.../get_file/8512/...mp4>), no function/0.
- the get_file 302 is fast (~100ms) but the 2160p/4K source on fpvcdn.com TIMES OUT
  (~30s); 720p/480p resolve in ~1s. The player loading 4K first = the "loading forever".
- the final fpvcdn URL embeds the requester IP (ip=<fetcher>) -> IP-bound to whoever
  resolves it; BUT the get_file itself is stateless (fresh session works) and valid >=90s,
  and binds fpvcdn to the fetcher. So a VPS resolve would bind to the VPS IP (mobile 403),
  but returning the get_file URL UNRESOLVED lets the phone follow the 302 itself ->
  fpvcdn binds to the phone IP -> plays.

Fix: new _source_getfile resolver returns get_file URLs as mobile_direct (skip 4K),
phone resolves the 302 in-session. Native, multi-quality, no WebView, no proxy.
Replaces fullmovies' old force_proxy+4K extractor and the WebView fallback for both.
Backend-verified: resolve -> 720/480 mobile_direct, get_file fresh fetch -> 206. Pending
on-device confirmation (emulator unstable; same mechanism as porn00/freshporno which work).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 22:48:55 +02:00
jtrzupek
c05bafb4c7 fix(porn00): backend KVS resolve (portable CDN, no proxy) — corrects #20
Same proper re-investigation as freshporno (DevTools + Bright Data residential
cross-IP + curl_cffi browser TLS). porn00's final CDN fe.porn00.org/...?token=&expires=
is PORTABLE cross-IP (token resolved from one residential IP replays 206 from a
different Bright Data residential IP) and only rejects non-browser TLS (plain curl
403, curl_cffi chrome 206). In #20 I tested the final URL with a standalone plain
curl, got 403, wrongly concluded "IP-bound" and left it on WebView (and before that
it used force_proxy, which violated the no-proxy stance).

porn00 flashvars are plain get_file (already decoded, no function/0 prefix), so
extend _kvs._URL_RE to match both forms — real_url passes plain URLs through
unchanged, _resolve_get_file follows the 302 in-session. porn00.py becomes a thin
_kvs wrapper. Verified no regression for the function/0 tubes (yespornvip/pornditt/
freshporno still resolve 3x mp4). Result: porn00 native multi-quality, mobile_direct,
zero proxy/WebView.

fpoxxx and pornxp were re-tested the same way and ARE genuinely IP-bound (403 from a
different residential IP — their token binds to the resolver IP), so they correctly
stay on the WebView fallback.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 21:15:19 +02:00
jtrzupek
6e3ad870a7 fix(freshporno): backend KVS resolve (portable CDN) — corrects #20
Re-investigated with the proper method (Chrome DevTools network capture + cross-IP
test via Bright Data residential proxy + curl_cffi browser-TLS) instead of guessing.
freshporno's real flow is get_file -> 302 -> cdn4.freshporno.org/remote_control.php
-> 206 video/mp4. The CDN URL is PORTABLE cross-IP (a token generated from one
residential IP replays fine from the VPS and from a different Bright Data residential
IP), it only rejects non-browser TLS fingerprints (plain curl -> 000, curl_cffi
chrome / ExoPlayer -> 206).

In #20 I tested the final URL with a standalone plain curl, got 000, and wrongly
concluded "unreachable from residential" -> kept it on the WebView fallback, which
barely worked (ad-heavy page, flaky). That false negative is the regression the user
reported. freshporno is function/0 KVS, so _kvs.resolve_kvs (which uses curl_cffi
chrome) already decodes + resolves it to a portable mp4 — switch to backend resolve
like yespornvip/pornditt: native, multi-quality, no proxy, no WebView.

Verified: backend resolve returns 3x mp4 (1080/720/480, mobile_direct) + cdn 206;
user confirmed native playback on device.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 21:12:17 +02:00
jtrzupek
c18ed24330 extractors: register fullmoviesxxx + hdporngg (WebView fallback)
Bug 19866e9e ("problem z oboma hosterami"): a scene whose only two sources were
fullmovies.xxx and hdporn.gg wouldn't play at all — neither had an entry in the
extractor registry, so try_extract returned None ("no stream"). fullmovies.xxx
serves a <source ...get_file...mp4> but the get_file CDN times out from the VPS
(unreachable, like freshporno), so backend resolve isn't viable; hdporn.gg sample
pages 404. Route both through the WebView fallback so the phone (residential IP)
loads the page and plays / the injected-JS scrape can grab the URL — strictly
better than no playback path. Surfaced by the hoster sweep + this bug report.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-03 22:16:05 +02:00
jtrzupek
e42217773f feat(deep-crawl): xvideos browse source (capped) + per-tube page cap
xvideos SSR's JSON-LD VideoObject (duration/title/uploadDate) + on-page /models/ (perf)
+ /tags/. Sample: median ~10.5min, 93% >=3min. Pilot (2 pages): 29 new, 100% playable +
visible + tagged (performers sparse — xvideos 'new' is amateur-heavy; /models/ tagged
mostly on studio rips).

- XVideosBrowseScraper (JSON-LD + page-parse models/tags), in ALL_BROWSE_SCRAPERS.
- deep_crawl._PAGE_CAP: per-sitetag depth cap; xvideoscom=1800 (~newest 50k). At the cap
  the tube is marked exhausted (reset -> incremental re-sweep) so a mega-tube cannot
  monopolize the round-robin or balloon the DB.
- ported yesporn.py into the public repo (was prod-only, like hdporngg) ending the
  __init__ public/prod divergence.

youporn rejected: JSON-LD lacks actor/keywords, its /pornstar//category/ links are A-Z
nav not scene-specific. xhamster: 429/Cloudflare from the VPS IP.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-03 11:16:44 +02:00
jtrzupek
ee4915770f feat(deep-crawl): eporner via JSON API as SSR-rich source (Phase 2b alternative)
porntrex/hqporner rejected for deep-crawl: KVS sites with no SSR metadata (77% of
existing porntrex has no duration -> invisible under the app's >=60 filter). eporner
instead exposes a public JSON API (api/v2/video/search) returning title + length_sec
+ keywords + added per video; ~100k videos, ~100/page, no per-scene detail fetch.

- BaseBrowseScraper.crawl_page(page): factored out of latest_scenes; returns None
  (transient fail) / [] (catalog end) / [scenes]. API subclasses override it.
- deep_crawl drives via crawl_page (supports HTML-listing AND API sources).
- EpornerApiScraper: crawl_page hits the eporner API -> RawScene with duration+tags+
  date+thumb+playback; registered in ALL_BROWSE_SCRAPERS.
- Pilot (2 API pages): 192 new, 100% playable + tagged + visible (>=60); the <180s
  trailer filter dropped 6 short clips.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-03 10:37:20 +02:00
jtrzupek
0f19a61789 feat(ingest): skip <180s tube scenes (trailers) + purge porndoe trailer orphans
Deep-crawling tube catalogs pulls in lots of <3min trailers/teasers (porndoe). Add
min_ingest_duration_sec (default 180): _process_scene skips scraper-source scenes whose
known duration is below the floor (unknown duration kept; canonical TPDB/StashDB
untouched). Deleted 67 existing porndoe-only orphan trailers (<180s, no canonical, no
non-porndoe live playback) via cascade.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-03 10:11:25 +02:00