goon-foss/goon - Forgejo: Beyond coding. We forge.

Author	SHA1	Message	Date
jtrzupek	1654d78d59	fix(ingest): strip NUL bytes from raw payloads before Postgres write A source (TPDB) returned a performer alias containing a literal U+0000 ("Ramon.."). Postgres cannot store in JSONB or text, so the external_records JSONB insert in _upsert_external_record failed with UntranslatableCharacter and the scene never ingested (GOON-Z). Recursively strip NUL from the raw payload (-> external_records.raw) and, when present, also re-validate the RawScene/RawMovie so normalize -> typed text columns get clean data too. Gated by a cheap _has_nul scan so clean records (the overwhelming majority) pay no extra cost. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 19:48:22 +02:00
jtrzupek	aa05ce2647	feat(playback): direct-HLS manifest passthrough + proxy stream drop handling Time-bound HLS hosters whose manifest URL lacks a .m3u8 extension (e.g. pornhat's "...mp4,?..." path) were mis-detected by ExoPlayer as progressive MP4 and failed, forcing a full proxy fallback that streamed the whole video through the server. Serve such manifests via /proxy/hls/<token>/play.m3u8 with child URLs left absolute on the CDN, so the device fetches variant+segments directly and only the ~1KB manifest is proxied. Routed only for mobile_direct_ok (time-bound) HLS without a .m3u8 path. Also swallow httpx.TransportError in the stream proxy body generator: an upstream CDN closing the connection mid-stream is benign (client just retries a range) and should not surface as an unhandled error. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 16:14:25 +02:00
jtrzupek	956a0feb22	docs: correct Bright Data proxy type (ISP, flat-rate not per-GB) It is an ISP proxy (static ISP IPs, flat billing), not residential — so HTML-ingest bandwidth is free and the full deep-crawl is fine. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 19:18:40 +02:00
jtrzupek	21bc8bf1fe	feat(superporn): browse scraper via Bright Data residential proxy superporn hard-blocks the VPS IP with Cloudflare 403 on every TLS impersonation, so HTML ingest routes through Bright Data residential (BRIGHTDATA_PROXY_URL, parsed in config). First scraper to use a proxy: optional _proxy on the browse base, threaded into browser_get. JSON-LD VideoObject (title/desc/uploadDate/thumb/duration) + pornstar and category chips; superporn double-encodes HTML entities so titles are unescaped twice. Thumbnails fetch fine from the VPS (no proxy). Playback stays off-proxy: the <source> mp4 token is IP-bound to the fetcher, so resolve is phone-side via WebView (extractor superporncom -> _vps_blocked_fallback), same as porndoe. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 18:47:45 +02:00
jtrzupek	80fd83cb4e	feat(tubes): add 4k69 + neporn browse scrapers, shared PlayTube base 4k69.com (~65k scenes): same PlayTube CMS as hqfap - common logic moved to _playtube.py (sitemap catalog, JSON-LD, pills). Studio classified by matching category pills against the studios index page. Streams are get_file (fullmovies family) returned unresolved with mobile_direct, 2160p skipped. neporn.com: KVS engine, latest-updates listing, JSON-LD + video:duration meta, performers from models links with flashvars video_tags fallback for fresh uploads. Resolve via _kvs; final URL portable cross-IP. superporn.com rejected: Cloudflare 403 from VPS on all TLS impersonations. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 18:15:13 +02:00
jtrzupek	6de986b9a7	feat(hqfap): browse scraper + native mp4 extractor (~120k scenes) PlayTube CMS. Sitemap-based pagination (listing has no GET paging), JSON-LD VideoObject metadata, pornstar/category pills, " Clips" categories mapped to studio. Direct mp4 (cdnde.com/okcdn.ru), tokens time-bound and portable cross-IP, so mobile plays direct. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 17:51:04 +02:00
jtrzupek	08079787da	feat(sxyprn): on-demand thumbnail resolver (live posters, ~1h-TTL workaround) trafficdeposit poster tokens live ~1h (hour-bucketed), so stored URLs can't persist. New GET /proxy/sxyprn-thumb/{post_id}: resolves the current og:image from the live /post/<id> page (cache resolved poster URL ~40min), streams bytes with Referer + long client Cache-Control (URL is stable per post_id → client disk-caches the image, backend fetches each post ~once). Deleted posts ("Post Not Found") → 404. Scene grid now emits /proxy/sxyprn-thumb/<id> for sxyprn sources (derived from page_url) instead of the dead stored trafficdeposit URL. Verified: live post → 200 image, deleted → 404, grid emits resolver URL. Backend-only, no OTA. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-10 15:02:49 +02:00
jtrzupek	f7670963df	fix(sxyprn): disable thumbnail refresh job — trafficdeposit token has ~1h TTL CORRECTION: trafficdeposit thumbnail tokens are hour-bucketed and valid only ~1h (verified 2026-06-10: stored ts=11:00 dead at 12:27, current ts=13:00 loads). Earlier "~weekly rot" read was wrong. Storing/periodically-refreshing sxyprn thumbnail URLs is futile — they expire within the hour. Default the refresh job OFF (kept in code). The dead-marking sweep (Post Not Found → dead_at) it performed was still valid. Live sxyprn thumbnails need on-demand resolution at serve time (future work). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-10 14:29:24 +02:00
jtrzupek	fef28ae56b	feat(sxyprn): refresh rotting thumbnails from live post pages + scheduled job CORRECTION to earlier "unrecoverable" call: the /post/<id> page is alive (200) and DOES expose the scene's own fresh-signed poster via og:image / <video poster> (post-id embedded, current timestamp) — only the STORED thumbnail URL had rotted. Search/listings don't re-surface old posts (0 overlap), but per-post fetch works. scripts/refresh_sxyprn_thumbs.py: iterate live sxyprn sources, fetch post page, extract fresh og:image, UPDATE thumbnail_url (verified: refreshed URLs return 200). _job_refresh_sxyprn_thumbs: every 12h refresh the 1200 least-recently-updated sources (cycles the ~19k catalog within the expiry window). Pairs with the scene_resolver overwrite fix so refreshed thumbnails stick. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-10 10:36:30 +02:00
jtrzupek	bb9e1afc31	fix(resolver): refresh thumbnails on re-scrape instead of fill-only-if-null _upsert_playback_sources only set thumbnail_url when the existing value was NULL, so signed CDN thumbnails that ROT (sxyprn/trafficdeposit tokens expire ~weekly → 404) were never replaced even when a fresh re-scrape captured a valid URL — making the rot permanent (bug 2026-06-10). Always overwrite thumbnail_url/animated_thumbnail_url with the freshly-scraped value when present; other fields keep fill-if-null. Lets the regular performer-driven ingest self-heal thumbnails for re-crawled scenes. (Note: old sxyprn backlog can't be bulk-refreshed — search/listings don't re-surface those posts, verified 0 overlap — so it's forward-looking; old sxyprn-only scenes fall back to the clean placeholder.) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-10 10:28:18 +02:00
jtrzupek	adbdce1c75	fix(api): de-prioritize rotting sxyprn/trafficdeposit thumbnails sxyprn thumbnails are time-signed on trafficdeposit CDN and ROT — the signed asset 404s after ~weeks and can't be re-signed/refreshed server-side (bug 2026-06-10, ~15k sxyprn-only scenes showed broken thumbs). In the light-list slim-thumbnail pick, prefer a thumbnail from any non-trafficdeposit source; fall back to sxyprn only when it's the scene's sole thumbnail (recent ones still load; dead ones now render a clean placeholder client-side instead of a broken image). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-10 10:11:10 +02:00
jtrzupek	c8baa11604	feat(api): device-scope user state (favorites/progress/blacklists) Public instance has no accounts, so all user state was GLOBAL in DB — new users saw/overwrote each other's (and Jan's) favorites, watched badges and blacklists (bug 2026-06-10). Add device_id (VARCHAR 64) to 9 state tables with composite PK (device_id, entity_id); app sends X-Device-Id header (get_device_id dep). All favorites/scene-favorites/blacklist/watch + scene&movie list/detail (is_favorite, watched, blacklist-hide) now filter by device. Existing rows backfilled to 'legacy-shared'; POST /me/adopt-legacy reassigns them to the caller once. Old clients (no header) map to legacy-shared so they keep working until OTA updates. Migration 0022: add col, backfill, composite PK. Verified on prod: 967 progress rows preserved, device isolation holds (new device sees none of legacy state). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-10 08:58:01 +02:00
jtrzupek	e1c7efb947	chore(api): drop unused has_animated_thumbnail scene filter The hold-to-preview gesture is being removed (did nothing useful), and no client sends this filter. Remove the Query param, its EXISTS filter, and the pure-default count guard reference. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 09:52:15 +02:00
jtrzupek	e98ef6577e	feat(api): scene hide + merge-duplicate endpoints for long-press actions POST /scenes/{id}/hide — marks all playback_sources dead so the scene drops out of has_playback lists (reversible via dead_at; row kept for dedup/refs). POST /scenes/{keep_id}/merge/{drop_id} — merges drop into keep via scene_merge (moves refs/performers/tags/fingerprints/playback). Backs the new tile long-press menu (hide / mark-duplicate) replacing the dead animated-preview gesture. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 09:47:16 +02:00
jtrzupek	abddd27856	fix(proxy): stable image-proxy URLs so expo-image actually caches thumbnails make_token embedded the current timestamp in the expiry, so every /scenes fetch produced a DIFFERENT proxied URL for the same thumbnail → expo-image (keyed by URI) cache-missed and re-downloaded every list load / app launch. Add stable_bucket_sec: quantize the expiry base to a window so the URL is identical across requests. _wrap_image_proxy uses a 7-day bucket → thumbnails disk-cache for a week instead of re-fetching constantly. Answers "czy miniatury są cache'owane" — now yes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 09:45:22 +02:00
jtrzupek	3e8a221981	feat(extractors): native HLS for xhamster; hqporner flyflv player xhamster: move from WebView fallback to server-side native HLS. The scene page is fetchable server-side and the xhcdn master m3u8 (variants + segments) is time-bound, not IP-bound (verified cross-IP), so mobile plays the HLS direct with zero proxy bandwidth. New tubes/xhamster.py pulls the master m3u8 from SSR HTML and returns type='m3u8' mobile_direct; registry remaps xhamstercom off _vps_blocked_fallback. hqporner: add flyflv to the player-iframe host whitelist. hqporner rotated some players to flyflv.com; the CDN host was already whitelisted but the iframe host was not, so those scenes returned no stream. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 09:35:58 +02:00
jtrzupek	ffb80c7b60	feat(performer): replace dev Re-scrape button with top-tag chips bug-report 1a4bf258: "Re-scrape mógłby zniknąć, za to tagi/kategorie by mogły". Re-scrape was a dev-only bulk thumbnail/tag enrich — noise on the performer page (per-scene enrich already happens on SceneDetail). Removed it; kept Search. New GET /performers/{id}/tags aggregates scene_tags across the performer's live-playback scenes (top N). PerformerScenes renders them as chips → tap navigates to TagScenes. Search button widened to full row. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-08 11:56:26 +02:00
jtrzupek	f8b1e801ef	fix(api): collapse same-origin playback sources on scene detail A merged scene often aggregates several uploads from ONE tube (re-encodes / 4K dups). bug-report aa79a995 "why 2 links, both porntrex?" = same scene std + 4K (porntrex 2591377 + 2593449 "...in 4K"). In the UI these are indistinguishable links to one hoster (same extractor). Keep one best per origin: prefer duration matching the scene → any duration → first (origin-asc stable). Dead already filtered. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-08 11:50:45 +02:00
jtrzupek	65b9df073a	fix(extractors): route sxylandcom through _embed_iframe, not webview fallback Chrome-DevTools investigation of bug-report 827a50a1 (sxyland "long loading, then webview, no autoplay") showed sxyland embeds playmogo.com/e/<id> — a DoodStream clone (doodcdn.io infra, pass_md5 protocol, get_slides) behind an INVISIBLE Cloudflare Turnstile (not an interactive CAPTCHA; auto-passes in a real browser/WebView from a residential IP). The sxyland page itself is NOT Turnstile-gated — VPS curl pulls the playmogo iframe URL straight from the HTML. sxylandcom was wired to _vps_blocked_fallback → phone loaded the entire sxyland page in WebView (ads, click-to-play, no autoplay = the reported symptom), and the playmogo embed never reached the phone's dood resolver. _embed_iframe (which already lists sxyland in its docstring) extracts the playmogo embed and emits it as type='hoster' → PlayerScreen routes playmogo URLs to doodstream.ts (resolveDoodStream), which resolves phone-side (phone IP passes invisible Turnstile) → direct mp4 → autoplay. Mobile unchanged (hoster→dood path already exists for xmoviesforyou/siska). Backend-only. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-08 11:41:38 +02:00
jtrzupek	e23e2d1f17	fix(merge): move playback_sources on scene merge + exact-title+duration dedup merge_scenes never reassigned playback_sources → ON DELETE CASCADE dropped them with the absorbed scene. Cross-source (canonical) merges rarely had tube playback so it hid, but tube-dup merges silently LOST playback links. Add _move_playback_sources (global unique (origin,page_url) guarantees no collision on reassign). + merge_exact_title_duration.py: catches missing-merge dupes bulk_dedup misses (same performer + identical normalized title + identical duration_sec, no phash). Bad Bella had 25 such pairs (bug-report ef92809d "duplikat, te same miniatury"). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-08 10:56:50 +02:00
jtrzupek	7bf1fd6716	fix(xvideos): parse model name from nested span.name — recover 0-performer scenes xvideos renders the scene's models as `<a href="/models/slug">...<span class="name"> Display Name</span>...`. The old _MODEL_RE wanted text immediately after the anchor `>` and never matched current markup → browse-scraped scenes landed with 0 performers (bug-report 2026-06-07: "no actors, but Rebecca Johnson is on the page"). New regex captures slug + nested span.name, bounded within the anchor. + backfill script for the ~11.9k existing zero-performer xvideos scenes (54% have a real /models/ link; resolver merges names to canonical by name_normalized). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-08 10:13:21 +02:00
jtrzupek	2b602beea5	fix(dedup): tighten cross-source candidate prefilter — kill 1800s hang (GOON-V) _candidate used OR logic (studio OR date±7d OR dur±30s) → 938,950 pairs; Etap-2 scoring at ~110/s never finished in 1800s → bulk_dedup_performers HUNG every run, orphan thread leaked until restart. Require AND: same studio plus (date±2d OR dur±30s). 939k→16k pairs, full run 213s. Real cross-source dup of one master shares studio + near date/duration; rare studio_id-mismatch pairs skipped on purpose — a job that COMPLETES beats one that times out merging nothing. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-08 10:03:33 +02:00
jtrzupek	cd257740be	fix(hqporner): require ALL query tokens in slug — stop performer over-attribution hqporner search post-filter kept a scene if its slug contained ANY query token (>=3 chars). For multi-word performer names this matched on a single common token (e.g. "anna","mia"), so the performer-driven ingest attributed the scene to EVERY performer sharing that token — scenes accumulated up to 503 wrong performers (hqporner = 5659 of 5897 scenes with >30 performers; bug-reports 2026-06-07). Switch ANY->ALL: the slug must contain every query token, requiring a full name match before attribution. Single-word names still work. Precision over recall — 144 wrong performers is far worse than missing a few loose matches. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-08 09:28:18 +02:00
jtrzupek	43f7e1f7b2	perf(scenes): literal tag_id in filter — 4-12s tag lists -> ~20ms Tag-filtered scene lists (e.g. blowjob + has_playback) took 4-12s. Root cause: the filter joined scene_tags->tags on slug, so the actual tag_id was opaque to the planner at plan time. It fell back to average per-tag cardinality (8.4M/11541 ≈ 726) instead of the real 273k, chose to materialize ALL matching scene_tags + check playback per row, then top-N sort. Fix: resolve slug->tag_id in the app and filter on a LITERAL tag_id (no slug join). With a constant, the planner uses MCV stats, knows the tag is huge, and walks ix_scenes_created_at_desc probing scene_tags/playback per scene, stopping at the page limit. Verified: blowjob list 3300ms -> 18ms (EXPLAIN), HTTP 4-12s -> 47ms. Unknown slug short-circuits to empty. (Pairs with the raised tag_id statistics target so mid-tier tags also get correct estimates.) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-07 21:10:31 +02:00
jtrzupek	d52641774d	perf(scenes): light list payload — drop tags/refs, slim playback to thumbnail Scene list returned the full SceneOut per item (nested tags/external_refs + all playback_sources with page_url/embed/stream/quality) though SceneTile only reads the thumbnail + title/duration/performer/studio, and SceneDetail re-fetches the full scene via /scenes/{id}. Added light=True to _build_scenes_out_batch: skip the tags + external_refs queries entirely and collapse playback_sources to one slim entry (thumbnail_url + animated_thumbnail_url only). Result: default list payload 78KB->48KB (-38%), ~28ms cached, less DB work per list. Verified on emulator: grid thumbnails/durations/titles render unchanged. No mobile change (tile reads the same fields); server-side, no OTA. NOTE: the separate slow path — common-tag-filtered lists (4-12s; query expands all matching scene_tags before sort/limit) — is structural (needs a denormalized (tag_id, created_at) index) and deferred. VACUUM ANALYZE + raised tag_id stats applied but the planner still can't avoid the materialization. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-07 21:03:26 +02:00
jtrzupek	4922646011	feat(dedup): merge exact-phash + same-duration + shared-performer duplicates bug-report 2026-06-03 ("ten sam czas, ta sama miniaturka, czemu się nie mergują"): duplicate scenes not merged at ingest. Exact phash alone is noisy here (95% are collisions on shared thumbnails/intro frames — different scenes; bulk_dedup scorer correctly gives 0 auto-merge). The safe subset is exact-phash AND same duration (±3s) AND shared performer/title — near-certain same scene. Same-duration is key: it excludes the false-merge pattern (short-clip-vs-full has DIFFERING durations). - scripts/merge_phash_exact_dupes.py: one-off, dry-run by default, per-pair re-fetch (handles clusters). Applied: 30 merged. - bulk_dedup: add `_pairs_exact_phash` (SQL O(N log N), not the O(N²) Hamming scan) + strategy "phash_exact" — gated by the normal scorer (surfaces review candidates, no risky auto-merge), schedulable for ongoing exact-collision review. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-07 20:08:06 +02:00
jtrzupek	a196fcbcdb	refactor(ingest): rename scraper Source name "pornapp" -> "tube-scraper" The umbrella Source.name for all direct tube scrapers (deep-crawl, browse-latest, performer-driven) was "pornapp" — a misleading leftover from the removed external porn-app API. It read like a dependency on a third-party "pornapp" service; it is not — these are our own scrapers hitting 25+ tubes directly (kind=scraper, origin tube:<sitetag>). Renamed to "tube-scraper" via a single SCRAPER_SOURCE_NAME constant; DB row renamed in place (UPDATE name, same id) so all ingest_runs + external_records history stays linked. No behavior change — external_id keying (sitetag:url) and dedup are unaffected. NOTE: playback_sources.origin "pornapp:<sitetag>" prefix is a separate legacy format (resolve_playback parses it) and is intentionally left untouched. Verified on prod: row renamed (0 stray "pornapp"), new runs land on "tube-scraper". Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-07 16:54:55 +02:00
jtrzupek	8c0edbdf7b	fix(playback): mark deleted sxyprn posts dead + rank native sources first Two bug-report fixes (2026-06-07): - sxyprn returns HTTP 200 "Post Not Found" for deleted posts (soft-404), so the extractor returned None → resolve treated it as transient and never marked the source dead, leaving a dead link offered forever. Now raise HosterDead on the marker so resolve marks it dead. - Scene playback sources were ordered alphabetically by origin, so a WebView- fallback hoster (fpoxxx, IP-bound + ad-heavy) ranked above a working native source (freshporno) on the same scene. Add is_vps_blocked_fallback() and sort native-resolve origins ahead of WebView-fallback ones. Verified on prod: sxyprn dead URL → HosterDead; scene sources reorder freshpornoorg before fpoxxx. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-07 14:09:01 +02:00
jtrzupek	9d0cb7f26e	fix(scheduler): bulk_dedup performers cross_source_only + hard-timeout (OOM) _job_bulk_dedup_performers called run_bulk_dedup(strategy="performers") without the cross_source_only guard whose docstring exists precisely to prevent this OOM. At current catalog scale the unguarded path materializes N²/2 pairs per prolific performer into a list → worker hit 6GB RSS and was OOM-killed every 12h (05:00/ 17:00), taking down concurrent tpdb/stashdb/movie ingests as killed_by_restart (0 new movies). Verified in prod: 05:00 run now completes (885k pairs scored, no OOM) and ingests succeed (stashdb +241, tpdb +175). Also wrap in _run_with_timeout like tpdb/stashdb (job had no hard-timeout). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-07 11:00:19 +02:00
jtrzupek	fad72e9cd6	fix(tags): merge <base>2 numbered-duplicate tags + prevent regeneration TPDB taxonomy emits numbered-duplicate tags (name "Bubble Butt2"); slugify yields "bubble-butt2" (no separator before digit), so resolve_tag created a separate tag alongside "bubble-butt". Tube scenes inherited the dup via scene-merge → 75 pairs, ~10k scene_tags on the wrong tag. - resolve_tag: canonicalize "<base>2" -> "<base>" when base exists (handles current + future; trailing-"2"+alpha guard leaves milf-30/teen18 intact) - scripts/merge_dup2_tags.py: one-off bulk merge (scene_tags + movie_tags + blacklist) and taxonomy-count refresh Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-06 23:18:44 +02:00
jtrzupek	210aec0536	feat(scrapers): extract tags + description from porndish scene pages porndish-only scenes had no tags and no description — the scraper only derived a title from the URL slug. The scene page (g1/bimber WP theme) carries both: a <p class="entry-tags"> list of /video2/<slug>/ links (the "#" tags the user sees, categories + co-performers) and a prose description <p> in .entry-content. Override _fetch_scene_metadata in PornDishScraper to pull both from one page fetch. Extend the base hook to accept an optional 4th return element (description) and thread it into RawScene.description — backward compatible with the existing 3-tuple (pornhat). Strips leading embed-button labels ("Video Player N", "Server N") from the prose. Verified on live scenes: clean tag lists + real descriptions. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-06 21:32:10 +02:00
jtrzupek	83918e9a8d	perf(movies+scenes): direct-play #hash movie hosters; skip empty blacklist filters Movies: the seekplayer-engine family (easyvidplayer/player4me/seekplayer/ embedseek/upns, ~322k sources) returns a time-bound master.m3u8 on a CDN with a valid IP-SAN cert that plays cross-IP. Mark it mobile_direct in resolve, and make MovieDetailScreen prefer direct_url with a proxy fallback (mirrors the scene path) — previously every movie streamed through the VPS proxy. Paradisehill multipart parts now go direct too. Device-verified: ExoPlayer plays the raw CDN direct, zero proxy traffic, no flicker. Scenes: the three blacklist NOT EXISTS clauses were appended to every filtered list and evaluated per-row even when all blacklist tables are empty (~3.4s tax on a deep mega-tag walk). Skip them when the tables are empty (cached check) — mega-tag list 6.7s -> 3.3s, and every filtered list benefits. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-06 19:44:41 +02:00
jtrzupek	e780e1ae6f	fix(hdporngg+fullmovies): native get_file, skip broken 4K — "loading forever" User: "hdporngg loading forever". DevTools + cross-IP investigation (not guessing): - site is alive (sample scenes 200; the one earlier 404 was a single removed video, not the site — my earlier "site dead" was a hasty generalization). - both are the same platform (<source src=.../get_file/8512/...mp4>), no function/0. - the get_file 302 is fast (~100ms) but the 2160p/4K source on fpvcdn.com TIMES OUT (~30s); 720p/480p resolve in ~1s. The player loading 4K first = the "loading forever". - the final fpvcdn URL embeds the requester IP (ip=<fetcher>) -> IP-bound to whoever resolves it; BUT the get_file itself is stateless (fresh session works) and valid >=90s, and binds fpvcdn to the fetcher. So a VPS resolve would bind to the VPS IP (mobile 403), but returning the get_file URL UNRESOLVED lets the phone follow the 302 itself -> fpvcdn binds to the phone IP -> plays. Fix: new _source_getfile resolver returns get_file URLs as mobile_direct (skip 4K), phone resolves the 302 in-session. Native, multi-quality, no WebView, no proxy. Replaces fullmovies' old force_proxy+4K extractor and the WebView fallback for both. Backend-verified: resolve -> 720/480 mobile_direct, get_file fresh fetch -> 206. Pending on-device confirmation (emulator unstable; same mechanism as porn00/freshporno which work). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-05 22:48:55 +02:00
jtrzupek	c05bafb4c7	fix(porn00): backend KVS resolve (portable CDN, no proxy) — corrects #20 Same proper re-investigation as freshporno (DevTools + Bright Data residential cross-IP + curl_cffi browser TLS). porn00's final CDN fe.porn00.org/...?token=&expires= is PORTABLE cross-IP (token resolved from one residential IP replays 206 from a different Bright Data residential IP) and only rejects non-browser TLS (plain curl 403, curl_cffi chrome 206). In #20 I tested the final URL with a standalone plain curl, got 403, wrongly concluded "IP-bound" and left it on WebView (and before that it used force_proxy, which violated the no-proxy stance). porn00 flashvars are plain get_file (already decoded, no function/0 prefix), so extend _kvs._URL_RE to match both forms — real_url passes plain URLs through unchanged, _resolve_get_file follows the 302 in-session. porn00.py becomes a thin _kvs wrapper. Verified no regression for the function/0 tubes (yespornvip/pornditt/ freshporno still resolve 3x mp4). Result: porn00 native multi-quality, mobile_direct, zero proxy/WebView. fpoxxx and pornxp were re-tested the same way and ARE genuinely IP-bound (403 from a different residential IP — their token binds to the resolver IP), so they correctly stay on the WebView fallback. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-05 21:15:19 +02:00
jtrzupek	6e3ad870a7	fix(freshporno): backend KVS resolve (portable CDN) — corrects #20 Re-investigated with the proper method (Chrome DevTools network capture + cross-IP test via Bright Data residential proxy + curl_cffi browser-TLS) instead of guessing. freshporno's real flow is get_file -> 302 -> cdn4.freshporno.org/remote_control.php -> 206 video/mp4. The CDN URL is PORTABLE cross-IP (a token generated from one residential IP replays fine from the VPS and from a different Bright Data residential IP), it only rejects non-browser TLS fingerprints (plain curl -> 000, curl_cffi chrome / ExoPlayer -> 206). In #20 I tested the final URL with a standalone plain curl, got 000, and wrongly concluded "unreachable from residential" -> kept it on the WebView fallback, which barely worked (ad-heavy page, flaky). That false negative is the regression the user reported. freshporno is function/0 KVS, so _kvs.resolve_kvs (which uses curl_cffi chrome) already decodes + resolves it to a portable mp4 — switch to backend resolve like yespornvip/pornditt: native, multi-quality, no proxy, no WebView. Verified: backend resolve returns 3x mp4 (1080/720/480, mobile_direct) + cdn 206; user confirmed native playback on device. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-05 21:12:17 +02:00
jtrzupek	c18ed24330	extractors: register fullmoviesxxx + hdporngg (WebView fallback) Bug 19866e9e ("problem z oboma hosterami"): a scene whose only two sources were fullmovies.xxx and hdporn.gg wouldn't play at all — neither had an entry in the extractor registry, so try_extract returned None ("no stream"). fullmovies.xxx serves a <source ...get_file...mp4> but the get_file CDN times out from the VPS (unreachable, like freshporno), so backend resolve isn't viable; hdporn.gg sample pages 404. Route both through the WebView fallback so the phone (residential IP) loads the page and plays / the injected-JS scrape can grab the URL — strictly better than no playback path. Surfaced by the hoster sweep + this bug report. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-03 22:16:05 +02:00
jtrzupek	e42217773f	feat(deep-crawl): xvideos browse source (capped) + per-tube page cap xvideos SSR's JSON-LD VideoObject (duration/title/uploadDate) + on-page /models/ (perf) + /tags/. Sample: median ~10.5min, 93% >=3min. Pilot (2 pages): 29 new, 100% playable + visible + tagged (performers sparse — xvideos 'new' is amateur-heavy; /models/ tagged mostly on studio rips). - XVideosBrowseScraper (JSON-LD + page-parse models/tags), in ALL_BROWSE_SCRAPERS. - deep_crawl._PAGE_CAP: per-sitetag depth cap; xvideoscom=1800 (~newest 50k). At the cap the tube is marked exhausted (reset -> incremental re-sweep) so a mega-tube cannot monopolize the round-robin or balloon the DB. - ported yesporn.py into the public repo (was prod-only, like hdporngg) ending the __init__ public/prod divergence. youporn rejected: JSON-LD lacks actor/keywords, its /pornstar//category/ links are A-Z nav not scene-specific. xhamster: 429/Cloudflare from the VPS IP. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-03 11:16:44 +02:00
jtrzupek	ee4915770f	feat(deep-crawl): eporner via JSON API as SSR-rich source (Phase 2b alternative) porntrex/hqporner rejected for deep-crawl: KVS sites with no SSR metadata (77% of existing porntrex has no duration -> invisible under the app's >=60 filter). eporner instead exposes a public JSON API (api/v2/video/search) returning title + length_sec + keywords + added per video; ~100k videos, ~100/page, no per-scene detail fetch. - BaseBrowseScraper.crawl_page(page): factored out of latest_scenes; returns None (transient fail) / [] (catalog end) / [scenes]. API subclasses override it. - deep_crawl drives via crawl_page (supports HTML-listing AND API sources). - EpornerApiScraper: crawl_page hits the eporner API -> RawScene with duration+tags+ date+thumb+playback; registered in ALL_BROWSE_SCRAPERS. - Pilot (2 API pages): 192 new, 100% playable + tagged + visible (>=60); the <180s trailer filter dropped 6 short clips. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-03 10:37:20 +02:00
jtrzupek	0f19a61789	feat(ingest): skip <180s tube scenes (trailers) + purge porndoe trailer orphans Deep-crawling tube catalogs pulls in lots of <3min trailers/teasers (porndoe). Add min_ingest_duration_sec (default 180): _process_scene skips scraper-source scenes whose known duration is below the floor (unknown duration kept; canonical TPDB/StashDB untouched). Deleted 67 existing porndoe-only orphan trailers (<180s, no canonical, no non-porndoe live playback) via cascade. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-03 10:11:25 +02:00
jtrzupek	7e46e5ac48	feat(scheduler): deep-crawl full tube catalogs (Phase 2a — ingest-all) We ingested only ~3% of each browse tube's catalog (porndoe >62k scenes; we had 1959) because tubes were hit only by performer-search + top-N browse. Pilot (porndoe pages 64-110): 1119 new scenes, 100% playable + 100% tagged, 0% canonical overlap (purely additive — content not in TPDB/StashDB). - app/scheduler/deep_crawl.py: round-robin over ALL_BROWSE_SCRAPERS, per-tube page cursor in app/_state/deepcrawl_state.json (no DB migration), deep-paginate from the cursor, idempotent (resolver skips known by raw_hash), mark 'exhausted' at catalog end then reset cursors for an incremental re-sweep. - _job_deep_crawl: hourly, 60 pages/run (~1860 scenes, ~22 min), wrapped in the 1h hard-timeout; registered in build_scheduler (jobs=10). - config: sched_deep_crawl_hours=1, deep_crawl_pages_per_run=60, deepcrawl_state_path. - scripts/pilot_porndoe_deepcrawl.py: one-off pilot used to validate the approach. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-03 09:26:44 +02:00
jtrzupek	58b355b6b5	fix(pornhub): WebView fallback — yt-dlp gets 403 from VPS Hoster sweep (2026-06-02) found pornhub resolving to 0 sources: yt-dlp (current, 2026.03.17) gets HTTP 403 fetching the watch page from the Hetzner VPS, while the other yt-dlp tubes (xvideos/xnxx/youporn/redtube) still work — so it's a Pornhub-specific block of the server IP, not a yt-dlp regression. Route pornhub through the WebView fallback so it plays from the phone's residential IP, same as xhamster. 7.3k scenes affected. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-02 21:41:38 +02:00
jtrzupek	d4c4b79e92	fix(kvs): cap get_file timeout + early-break on dead scenes Bug 6ec1960e: yespornvip "resolving forever". yesporn.vip moved to a cdn4/remote_control.php CDN (still portable cross-IP — verified 206 from a residential IP, so backend resolve stays correct). But when a video is removed from the CDN the page still exists and each get_file 302-follow STALLS to the full timeout. With the resolve timeout (60s) applied per quality variant, a dead scene hung 3x60 = 180s and returned nothing -> the mobile resolve spinner never ended. Fix: a dedicated low get_file timeout (10s, separate from the page-fetch timeout) and an early-break once 2 variants fail with no result so far (the scene is dead on the CDN — no point waiting for the third). Dead scene now resolves to None in ~20s instead of 180s; a live scene is unaffected (~0.8s, 3 sources). Applies to all KVS tubes (yespornvip + pornditt). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-02 21:33:05 +02:00
jtrzupek	08f901712c	fix(scheduler): hard-timeout heavy jobs + periodic stuck-run reaper At the shared 05:00 anchor all heavy jobs fire together; tpdb/stashdb/performer-driven had no timeout, so a hung connector blocked the whole job and — with max_instances=1 — blocked every future fire of that job until a worker restart (incident 2026-06-02: 6 runs hung 8.7h, movie mirrors 47h stale, tube ingest stalled). - _run_with_timeout wraps tpdb/stashdb/performer-driven in a 30-min hard cap (same ThreadPoolExecutor pattern movie-ingest already uses): on timeout the job returns and frees the scheduler slot; the orphaned thread lives until restart. - _job_reap_stuck: hourly reaper of 'running' >2h rows, registered in the scheduler — the startup-only reaper missed hangs while the worker stayed up for hours. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-02 16:17:50 +02:00
jtrzupek	983bf62416	perf(scenes): drop exact count on filtered lists; index scene_tags(tag_id) The filtered scene-list endpoints (default feed sends min_duration_sec=60, plus has_playback / tag / q filters) took ~4.5s — and an idle server. Profiling showed the entire cost was the bounded COUNT subquery over the EXISTS filters: Postgres would not reliably early-terminate at the cap under psycopg bound params, scanning the whole matching set (~858k for has_playback). Counting over the PK and using a literal LIMIT helped some cases but the plan stayed unstable. Fix: stop computing an exact count for filtered lists entirely. The mobile client paginates by has_more (per_page+1 fetch), never by total — total is only the "N+" UI counter. Derive total as a lower bound from the page + has_more after the fetch. This removes the count query from every filtered request. Result (end-to-end, authenticated): default feed 4.5s -> ~0.1s, has_playback 4.4s -> ~0.1s, q/studio/normal-tag filters all <0.3s. Also added index scene_tags(tag_id, scene_id) (PK led with scene_id, so tag->scenes did a seq scan). Remaining: a single enormous tag (e.g. "anal", ~163k scenes) ordered by recency still gathers-all-then-sorts in the fetch (~5s); normal tags are <0.5s. Tracked in #22 for a denormalized recency-ordered approach. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-02 12:00:36 +02:00
jtrzupek	20a8dc8e27	perf(scenes): count over PK, not whole entity, in filtered list The bounded count for filtered scene lists ran `SELECT count() FROM (SELECT scenes. ... LIMIT 1001)` because the base query selects the full Scene entity. Counting over all columns made the planner pick a far worse plan via psycopg bound params (~4s for has_playback) than the same logic over the PK (~30-400ms). Count semantics are unchanged — we only need rows to exist — so count over `base.with_only_columns(Scene.id)`. Partial: this fixes the count leg. The main ordered fetch on filtered lists (has_playback / tags) can still pick a gather-all-then-sort plan under bound params (fast with literal binds, slow parameterized) — tracked separately. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-02 11:14:38 +02:00
jtrzupek	817b50fbf8	fix(scenes): propagate playback duration to Scene + duration-consistent counts Scene.duration_sec was NULL for ~74% of playable scenes (tube duration lives on playback_source, never propagated to Scene), so the mobile min_duration_sec=60 filter (Scene.duration_sec >= 60; NULL fails) silently hid them — surfaced as '119 in favorites, 14 after entering the performer' (Safira Yakkuza). - resolver: _effective_duration() falls back to max live playback_source duration when the connector provides no scene-level duration (forward fix, used in create + update). - scripts/backfill_scene_duration_from_playback.py: one-off idempotent backfill (recovered 204,014 scenes). - taxonomy_counts: scene_count now counts playable AND duration_sec >= 60, matching the always-60s-filtered scene lists, so favorites/performer/studio/tag badges agree with what the scene screen actually shows (Safira: 39 == 39). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-01 21:31:01 +02:00
jtrzupek	cd12348782	fix(movies): paradisehill delta date-granularity + browse cadence docs - paradisehill.fetch_movies compared release_date coerced to midnight against the `since` timestamp, so the chronological crawl stopped at the first upload dated the same calendar day as `since` and silently dropped most new movies (0-2 seen per run; Movies tab stalled). Compare by DATE with a 1-day grace instead; idempotent external_records upsert dedups the re-fetched recent window. - scripts/backfill_paradisehill_movies.py: one-off no-delta deep crawl to recover the backlog missed during the bug (idempotent, resumable). - docs: correct stale 'raz dziennie/24h' browse-latest comments to 6h (4x/day), the actual configured cadence (config.py sched_browse_latest_hours=6). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-01 17:00:10 +02:00
jtrzupek	da7fcda132	feat(ingest): SQL phash match, tag inference + backfill, clip-store skip, browse tubes, watchdog Resolver/perf: - find_by_phash_within: nearest match via Postgres bit_count over bit(64) XOR instead of Python scan of all phash fingerprints (~20x faster per scene; unblocks long delta runs that were killed mid-run before since advanced). Scheduler/reliability: - reap ingest_runs stuck in 'running' on worker startup (killed_by_restart). - smoke_test: per-source ingest health, stuck-run and browse-freshness checks -> Sentry; exclude killed_by_restart from the failed-run alarm. Tags (ingest with tags + fill blanks): - wire infer_tag_slugs into normalize_scene so tube scenes get title-inferred tags (was dead code); union with connector tags. - scripts/backfill_inferred_tags.py: keyset/batched/idempotent backfill for existing tagless scenes (playable tag coverage 16% -> ~52%). Clip-store: - skip ManyVids/IWantClips/Clips4Sale/... from canonical sources at ingest (GOON_SKIP_CLIP_STORE, default on) — permanent orphans, ~56% of canonical ingest, never have a free-tube playback source. Browse tubes: - enable fullmovies + hdporn.gg: studio parsed from title prefix instead of the /networks/ sidebar (which always yielded the first listed network); drop phash compute (pilot: 0% canonical hit within Hamming 5 — auto-screenshots), matching relies on title/performer/duration. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-01 15:07:35 +02:00
jtrzupek	86c9bd438b	extractors: keep freshporno/porn00/pornxp/fpoxxx on WebView (IP-bound CDN) Re-checked whether these four KVS tubes could move to server-side resolve like yespornvip/pornditt/porntrex. All four are reachable from the backend, but cross-IP testing showed their final CDN URLs are IP-bound to the resolving host (403 / connection refused from a different IP; fpo.xxx even embeds the resolver IP in its acctoken). Unlike the portable cdn5/twa CDNs, backend resolve cannot produce a mobile-playable URL here without a proxy, which is out of scope for the public app. - porn00: was using force_proxy resolve (violated the no-proxy stance); switched to the WebView fallback like its siblings. The ad exposure that originally motivated the proxy path is mitigated by the recent ad-filter work (AD_HOSTS + cover overlay + injected-JS ad-CDN skipping). - freshporno/pornxp/fpoxxx already on WebView fallback; comments updated with the cross-IP findings so this isn't re-investigated. - Dropped the now-unused tube extractor imports (F401). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-01 10:55:44 +02:00
jtrzupek	920740b76f	fix(pornditt): server-side KVS resolve; extract shared _kvs helper pornditt is the same kt_player KVS engine as yespornvip: flashvars carry function/0/-obfuscated get_file urls + license_code, and the VPS reaches it (HTTP 200). It was on _vps_blocked_fallback (WebView), where the scrape grabbed the VAST preroll ad (trafostatic) instead of content (user bug "pornditt łapie reklamę zamiast video"). Extracted the verified yespornvip logic into app/extractors/tubes/_kvs.py (resolve_kvs: fetch page → decode function/0 get_file via kt_player algo → follow 302 in-session → portable CDN, multi-quality). yespornvip.py and new pornditt.py are now thin wrappers. Registry: porndittcom _vps_blocked_fallback → pornditt.extract. Verified on prod: pornditt → 720p/480p on twa.tgprn.com (portable, fresh-session 206 video/mp4); yespornvip still → 1080/720/480p on cdn5 (refactor intact). Backend-only, no OTA — mobile plays mp4+mobile_direct_ok natively with quality picker, zero WebView/ads. Note: a real-browser residential load shows MEDIA_ERR on the content (the page's own player flow / ad gating); server-side decode+follow sidesteps the player entirely, which is why it resolves cleanly. The original bug scene (40f118e1) has its video deleted on pornditt — verified on a live scene (156091). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-01 10:36:33 +02:00

1 2

71 commits