goon/app/scheduler
jtrzupek 4922646011 feat(dedup): merge exact-phash + same-duration + shared-performer duplicates
bug-report 2026-06-03 ("ten sam czas, ta sama miniaturka, czemu się nie mergują"):
duplicate scenes not merged at ingest. Exact phash alone is noisy here (95% are
collisions on shared thumbnails/intro frames — different scenes; bulk_dedup scorer
correctly gives 0 auto-merge). The safe subset is exact-phash AND same duration
(±3s) AND shared performer/title — near-certain same scene. Same-duration is key:
it excludes the false-merge pattern (short-clip-vs-full has DIFFERING durations).

- scripts/merge_phash_exact_dupes.py: one-off, dry-run by default, per-pair re-fetch
  (handles clusters). Applied: 30 merged.
- bulk_dedup: add `_pairs_exact_phash` (SQL O(N log N), not the O(N²) Hamming scan)
  + strategy "phash_exact" — gated by the normal scorer (surfaces review candidates,
  no risky auto-merge), schedulable for ongoing exact-collision review.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-07 20:08:06 +02:00
..
__init__.py Initial commit 2026-05-20 10:10:22 +02:00
browse_latest.py refactor(ingest): rename scraper Source name "pornapp" -> "tube-scraper" 2026-06-07 16:54:55 +02:00
bulk_dedup.py feat(dedup): merge exact-phash + same-duration + shared-performer duplicates 2026-06-07 20:08:06 +02:00
deep_crawl.py refactor(ingest): rename scraper Source name "pornapp" -> "tube-scraper" 2026-06-07 16:54:55 +02:00
jobs.py fix(scheduler): bulk_dedup performers cross_source_only + hard-timeout (OOM) 2026-06-07 11:00:19 +02:00
performer_driven.py refactor(ingest): rename scraper Source name "pornapp" -> "tube-scraper" 2026-06-07 16:54:55 +02:00
taxonomy_counts.py fix(scenes): propagate playback duration to Scene + duration-consistent counts 2026-06-01 21:31:01 +02:00
worker.py feat(scheduler): deep-crawl full tube catalogs (Phase 2a — ingest-all) 2026-06-03 09:26:44 +02:00