The one-off cleanup merged ~13.5k same-video-different-title dupes, but they regrow as
these sibling tubes re-ingest under new titles. Wire the asset-id+duration merge into
the scheduler (every 12h, GOON_SCHED_THUMB_DEDUP_HOURS, 0=off) so it stays clean.
Shared logic lives in app/scheduler/thumb_dedup.py (run_thumb_asset_dedup); the one-shot
script now imports it. Same tight signature as the cleanup: family hosts only + identical
duration (the bare asset-id number is reused across unrelated CDNs, so cross-host/diff-
duration grouping is excluded). Reports 205b17d9 / 5a2944cb.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
These sibling platforms share one video-id space and ingest the same video under
different titles, which bulk_dedup misses (different titles, no phash). Match by the
asset-id in the thumbnail path (/<bucket>000/<id>/) on img.hdporn.gg|fullmovies.xxx plus
identical duration, and merge. Hard host restriction + duration guard: the bare number
is reused for unrelated videos on other CDNs (verified via dry-run), so cross-host or
different-duration grouping is excluded. Run scoped (studio id) or global; dry-run by
default. Reports 205b17d9 / 5a2944cb. Ran on Parasited: 43 pairs merged.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
resolve_post() now distinguishes "Post Not Found" (mark dead_at — the
link wouldn't play anyway) from a live page with no fresh poster (leave
untouched), on top of the existing thumbnail refresh. Batched into
refresh_batch() with refreshed/dead/untouched counters.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
CORRECTION to earlier "unrecoverable" call: the /post/<id> page is alive (200) and
DOES expose the scene's own fresh-signed poster via og:image / <video poster>
(post-id embedded, current timestamp) — only the STORED thumbnail URL had rotted.
Search/listings don't re-surface old posts (0 overlap), but per-post fetch works.
scripts/refresh_sxyprn_thumbs.py: iterate live sxyprn sources, fetch post page,
extract fresh og:image, UPDATE thumbnail_url (verified: refreshed URLs return 200).
_job_refresh_sxyprn_thumbs: every 12h refresh the 1200 least-recently-updated sources
(cycles the ~19k catalog within the expiry window). Pairs with the scene_resolver
overwrite fix so refreshed thumbnails stick.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Final Polish-char print crashed with UnicodeEncodeError on Windows cp1252 stdout
AFTER a successful publish, making exit code 1 misleading. Reconfigure stdout/stderr
to UTF-8 up front.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
--playback-only restricts to scenes with live playback (app-visible dupes only).
Progress print every 500 merges for long global runs.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
merge_scenes never reassigned playback_sources → ON DELETE CASCADE dropped them
with the absorbed scene. Cross-source (canonical) merges rarely had tube playback
so it hid, but tube-dup merges silently LOST playback links. Add _move_playback_sources
(global unique (origin,page_url) guarantees no collision on reassign).
+ merge_exact_title_duration.py: catches missing-merge dupes bulk_dedup misses
(same performer + identical normalized title + identical duration_sec, no phash).
Bad Bella had 25 such pairs (bug-report ef92809d "duplikat, te same miniatury").
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
'--workers 3' set limit=3 because the bare '3' also hit the isdigit() branch.
Skip flag-value positions when scanning for a positional LIMIT.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
xvideos renders the scene's models as `<a href="/models/slug">...<span class="name">
Display Name</span>...`. The old _MODEL_RE wanted text immediately after the anchor
`>` and never matched current markup → browse-scraped scenes landed with 0 performers
(bug-report 2026-06-07: "no actors, but Rebecca Johnson is on the page"). New regex
captures slug + nested span.name, bounded within the anchor. + backfill script for the
~11.9k existing zero-performer xvideos scenes (54% have a real /models/ link; resolver
merges names to canonical by name_normalized).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
audit_false_merges only auto-fixes n>=3 (majority disambiguates the outlier); n=2
was "needs human review" — but the merge-review UI is gone, nobody triages 500+.
Measured: of 535 n=2 duration-divergent scenes, ALL have a canonical scene.duration_sec
(TPDB/StashDB) and 531 have exactly one source matching canonical (±20%) + one >2x off
→ unambiguous false-merge. Kill the off source (works both directions since canonical is
corroborated by the matching keeper, unlike the Omar-case the n>=3 audit guards against).
Applied: 529 sources marked dead (4 ambiguous skipped). Reversible (dead_at).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
bug-report 2026-06-03 ("ten sam czas, ta sama miniaturka, czemu się nie mergują"):
duplicate scenes not merged at ingest. Exact phash alone is noisy here (95% are
collisions on shared thumbnails/intro frames — different scenes; bulk_dedup scorer
correctly gives 0 auto-merge). The safe subset is exact-phash AND same duration
(±3s) AND shared performer/title — near-certain same scene. Same-duration is key:
it excludes the false-merge pattern (short-clip-vs-full has DIFFERING durations).
- scripts/merge_phash_exact_dupes.py: one-off, dry-run by default, per-pair re-fetch
(handles clusters). Applied: 30 merged.
- bulk_dedup: add `_pairs_exact_phash` (SQL O(N log N), not the O(N²) Hamming scan)
+ strategy "phash_exact" — gated by the normal scorer (surfaces review candidates,
no risky auto-merge), schedulable for ongoing exact-collision review.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
bug-report 2026-06-01 (48d6cc6b): scene shows canonical duration from TPDB
(real 22min studio scene) but the only live playback_source is a short tube
teaser (xnxx 21s) → "shows 22m, plays <1m". When ALL live sources are a tiny
fraction (<15%) of a known canonical (>300s), the scene has no real playback;
mark those sources dead → scene becomes orphan → hidden (has_playback=false),
consistent with the orphan-hiding policy. Reversible (dead_at), conservative
(skips scenes with any unknown-duration or full-length live source).
Applied on prod: 182 sources dead across 174 scenes.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The umbrella Source.name for all direct tube scrapers (deep-crawl, browse-latest,
performer-driven) was "pornapp" — a misleading leftover from the removed external
porn-app API. It read like a dependency on a third-party "pornapp" service; it is
not — these are our own scrapers hitting 25+ tubes directly (kind=scraper,
origin tube:<sitetag>). Renamed to "tube-scraper" via a single SCRAPER_SOURCE_NAME
constant; DB row renamed in place (UPDATE name, same id) so all ingest_runs +
external_records history stays linked. No behavior change — external_id keying
(sitetag:url) and dedup are unaffected.
NOTE: playback_sources.origin "pornapp:<sitetag>" prefix is a separate legacy
format (resolve_playback parses it) and is intentionally left untouched.
Verified on prod: row renamed (0 stray "pornapp"), new runs land on "tube-scraper".
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
TPDB taxonomy emits numbered-duplicate tags (name "Bubble Butt2"); slugify
yields "bubble-butt2" (no separator before digit), so resolve_tag created a
separate tag alongside "bubble-butt". Tube scenes inherited the dup via
scene-merge → 75 pairs, ~10k scene_tags on the wrong tag.
- resolve_tag: canonicalize "<base>2" -> "<base>" when base exists (handles
current + future; trailing-"2"+alpha guard leaves milf-30/teen18 intact)
- scripts/merge_dup2_tags.py: one-off bulk merge (scene_tags + movie_tags +
blacklist) and taxonomy-count refresh
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Ad-hoc research tool: for a list of candidate tubes, fetch a listing page, grab a scene
URL, and classify the detail — reachable / JSON-LD VideoObject / duration / performers /
tags. Used 2026-06-03 to evaluate deep-crawl candidates (redtube + drtuber look strong;
pornhub/spankbang/porntrex/hqporner/youporn rejected; nuvid/motherless bare).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
We ingested only ~3% of each browse tube's catalog (porndoe >62k scenes; we had 1959)
because tubes were hit only by performer-search + top-N browse. Pilot (porndoe pages
64-110): 1119 new scenes, 100% playable + 100% tagged, 0% canonical overlap (purely
additive — content not in TPDB/StashDB).
- app/scheduler/deep_crawl.py: round-robin over ALL_BROWSE_SCRAPERS, per-tube page cursor
in app/_state/deepcrawl_state.json (no DB migration), deep-paginate from the cursor,
idempotent (resolver skips known by raw_hash), mark 'exhausted' at catalog end then
reset cursors for an incremental re-sweep.
- _job_deep_crawl: hourly, 60 pages/run (~1860 scenes, ~22 min), wrapped in the 1h
hard-timeout; registered in build_scheduler (jobs=10).
- config: sched_deep_crawl_hours=1, deep_crawl_pages_per_run=60, deepcrawl_state_path.
- scripts/pilot_porndoe_deepcrawl.py: one-off pilot used to validate the approach.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Publishing the OTA from Windows git-bash failed at the scp step (2026-06-02):
- git-bash (MSYS) rewrote the /root/... env path to 'C:/Program Files/Git/root/...'
before Python saw it → upload targeted a bogus remote dir.
- scp local source 'C:\...\dist' is parsed as host 'C' (drive letter = host).
Fixes: default runtime 1.0→1.1 (active channel, app.json runtimeVersion=1.1); scp
source passed as '.' with cwd=DIST (no drive letter); MSYS_NO_PATHCONV=1 in subprocess
env; defensive un-mangle of a git-bash-converted VPS_BASE.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Scene.duration_sec was NULL for ~74% of playable scenes (tube duration lives on
playback_source, never propagated to Scene), so the mobile min_duration_sec=60 filter
(Scene.duration_sec >= 60; NULL fails) silently hid them — surfaced as '119 in favorites,
14 after entering the performer' (Safira Yakkuza).
- resolver: _effective_duration() falls back to max live playback_source duration when the
connector provides no scene-level duration (forward fix, used in create + update).
- scripts/backfill_scene_duration_from_playback.py: one-off idempotent backfill (recovered
204,014 scenes).
- taxonomy_counts: scene_count now counts playable AND duration_sec >= 60, matching the
always-60s-filtered scene lists, so favorites/performer/studio/tag badges agree with what
the scene screen actually shows (Safira: 39 == 39).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- paradisehill.fetch_movies compared release_date coerced to midnight against the
`since` timestamp, so the chronological crawl stopped at the first upload dated
the same calendar day as `since` and silently dropped most new movies (0-2 seen
per run; Movies tab stalled). Compare by DATE with a 1-day grace instead; idempotent
external_records upsert dedups the re-fetched recent window.
- scripts/backfill_paradisehill_movies.py: one-off no-delta deep crawl to recover the
backlog missed during the bug (idempotent, resumable).
- docs: correct stale 'raz dziennie/24h' browse-latest comments to 6h (4x/day), the
actual configured cadence (config.py sched_browse_latest_hours=6).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Resolver/perf:
- find_by_phash_within: nearest match via Postgres bit_count over bit(64) XOR
instead of Python scan of all phash fingerprints (~20x faster per scene;
unblocks long delta runs that were killed mid-run before since advanced).
Scheduler/reliability:
- reap ingest_runs stuck in 'running' on worker startup (killed_by_restart).
- smoke_test: per-source ingest health, stuck-run and browse-freshness checks
-> Sentry; exclude killed_by_restart from the failed-run alarm.
Tags (ingest with tags + fill blanks):
- wire infer_tag_slugs into normalize_scene so tube scenes get title-inferred
tags (was dead code); union with connector tags.
- scripts/backfill_inferred_tags.py: keyset/batched/idempotent backfill for
existing tagless scenes (playable tag coverage 16% -> ~52%).
Clip-store:
- skip ManyVids/IWantClips/Clips4Sale/... from canonical sources at ingest
(GOON_SKIP_CLIP_STORE, default on) — permanent orphans, ~56% of canonical
ingest, never have a free-tube playback source.
Browse tubes:
- enable fullmovies + hdporn.gg: studio parsed from title prefix instead of
the /networks/ sidebar (which always yielded the first listed network);
drop phash compute (pilot: 0% canonical hit within Hamming 5 — auto-screenshots),
matching relies on title/performer/duration.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Opt-in remediation for the duration-inconsistent scenes found by the audit.
Scope is deliberately narrow and reversible:
- only scenes with >=3 duration-bearing sources AND max/min ratio > 3x
- anchored on scene.duration_sec (the canonical value), never the median of
sources (a median is wrong when several bogus short clips outvote the real
full-length source)
- marks dead ONLY sources that are >2x SHORTER than the canonical — a falsely
merged source is almost always a short SEO clip/preview. Sources longer than
the canonical are left alone, since an over-long outlier more often means the
canonical duration itself is too low (so killing the long source would drop
the real video); those stay for manual review.
- guards that at least one live source remains
- dry-run by default; --yes to apply; sets dead_at (reversible), not delete
First run marked 514 short-clip sources dead across 228 scenes.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Read-only data-quality audit for scene merges made before the 2026-05-12
scoring hardening (which now caps weak-signal aggregator matches at 0.85 and
tightened the duration bump to <=3s). The auto-merge candidate log does not
record which external_ref was attached, so a merge cannot be reversed from the
log alone. Instead this detects false merges by their effect: a scene that
absorbed a different video ends up with playback_sources of inconsistent
durations (e.g. a 60s clip alongside a 2h source).
Reports counts + severity buckets by max/min duration ratio, can list the worst
offenders with a per-source breakdown, and can export suspects to JSON. Mutates
nothing — remediation (detach/mark-dead the outlier source) is left as an
explicit, separately-decided step because short durations can be legitimate
(previews) and n=2 scenes are ambiguous about which source is canonical.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
DIAGNOZA NA EMULATORZE (emulator-5554, goon-v0.1.9.apk):
Dwa błędne założenia z poprzednich sesji obalone empirycznie:
1. RUNTIME: APK ma EXPO_RUNTIME_VERSION="1.0" (NIE 0.1.9 — pomyliłem versionName
z runtime). App akceptuje TYLKO manifest runtime 1.0. Mój wcześniejszy
"fix" na 0.1.9 (c19da51) był wstecz — app go ignorował. Cofnięte: app.json
+ publish_update RUNTIME_DEFAULT z powrotem na "1.0".
2. CRASH: prawdziwa przyczyna "nic się nie pojawia" — OTA bundle z expo-font
crashował: "Cannot find native module 'ExpoFontLoader'" → expo-updates
ErrorRecovery rollback. APK (build 22-maja) nie ma natywnego ExpoFontLoader
(expo-font dodany 30-maja, PO buildzie APK). OTA NIE MOŻE dostarczyć native
modułu. Potwierdzone: embedded bundle + served bundle grep = 0 ExpoFontLoader;
stary font-bundle crashował, font-stripped NIE.
FIX: usunięto useFonts z App.tsx + expo-font import; theme.fonts → undefined
(system font); SceneTile/MoviePosterCard/navigation/GoonWordmark fontFamily →
fontWeight. Wszystko inne (2-col grid, oxblood, logo SVG-RNSVG-jest-w-APK)
zostaje. Custom fonty wrócą przy rebuildzie APK z expo-font (option B).
ZWERYFIKOWANE: bundle d5b87e5c (runtime 1.0, 0 ttf) — emulator launch:
`ReactNativeJS: Running "main"`, zero JS errors, brak ExpoFontLoader crash.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
ROOT CAUSE wszystkich "znikajacych" OTA (2026-05-29..30, ~6 publishow w prozni):
zainstalowany APK ma EXPO_RUNTIME_VERSION=0.1.9 (AndroidManifest), ale app.json
mialo runtimeVersion "1.0" i publish_update.py defaultowal --runtime 1.0.
Updaty ladowaly w /expo-updates/1.0/, a app z headerem expo-runtime-version:0.1.9
dostawal HTTP 204 (no update) i nigdy nic nie aplikowal mimo "OK live".
Fix:
- app.json runtimeVersion "1.0" -> "0.1.9" (== APK)
- publish_update.py RUNTIME_DEFAULT "1.0" -> "0.1.9"
- Republished caly skumulowany bundle pod 0.1.9 (ce275235) — zweryfikowane:
manifest dla expo-runtime-version:0.1.9 zwraca 200 + runtimeVersion:0.1.9 +
bundle 4.76MB serwuje 200.
Stary /expo-updates/1.0/ (~40 nieaplikowanych updateow) do usuniecia osobno.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Goon — self-hosted aggregator for adult-content scene metadata.
Indexes scenes from TPDB, StashDB, and 30+ public adult tube sites.
Cross-source deduplication via perceptual hash + Levenshtein distance.
FastAPI backend + APScheduler worker + React Native (Expo) mobile client.
FOSS, ad-free, donation-funded. See README for details.