goon/scripts/merge_exact_title_duration.py
jtrzupek f014a901de feat(scheduler): periodic title+duration dedup (missing-merge tube dupes)
Missing-merge duplicates (same performer + identical normalized title + identical duration-to-the-second) that bulk_dedup misses — tube re-scrapes and cross-tube re-ingests like porn00 pulling a video already present from xnxx (reports 28fe8181/32df33b1). Extracted the proven merge_exact_title_duration logic into app/scheduler/title_duration_dedup.py (script now a thin wrapper), wired a 12h scheduler job (playback-only = what users actually see, GOON_SCHED_TITLE_DEDUP_HOURS). Signal is near-certain (two different videos don't share byte-identical title AND exact duration); no shared performer = not merged (over-match guard). Verified: job registers (jobs=14), backlog currently 0 after the one-shot global merge.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-19 11:20:48 +02:00

49 lines
1.6 KiB
Python

"""Merge missing-merge duplikatów: ten sam performer + identyczny znormalizowany tytuł
+ identyczna długość (co do sekundy).
Logika w app/scheduler/title_duration_dedup.py (współdzielona ze schedulerem
`_job_title_duration_dedup`). Ten plik to cienki CLI wrapper.
Użycie (kontener worker):
python scripts/merge_exact_title_duration.py [PERFORMER_ID] [--commit] [--playback-only]
Bez PERFORMER_ID = wszyscy (global). Bez --commit = dry-run (wypisuje pary).
"""
from __future__ import annotations
import sys
from app.scheduler.title_duration_dedup import _groups, run_title_duration_dedup
def main() -> None:
commit = "--commit" in sys.argv
playback_only = "--playback-only" in sys.argv
pid = None
for a in sys.argv[1:]:
if not a.startswith("--") and len(a) >= 32:
pid = a
if not commit:
groups = _groups(pid, playback_only)
pairs = sum(len(g) - 1 for g in groups)
print(
f"performer={pid or 'ALL'} playback_only={playback_only} "
f"groups={len(groups)} merges={pairs} commit=False",
flush=True,
)
for g in groups:
for drop in g[1:]:
print(f" [dry] keep {g[0][:8]} <- drop {drop[:8]}")
print(f"DONE merged=0/{pairs} errors=0", flush=True)
return
res = run_title_duration_dedup(pid=pid, playback_only=playback_only, commit=True)
print(
f"DONE merged={res['merged']}/{res['merges']} errors={res['errors']} "
f"groups={res['groups']}",
flush=True,
)
if __name__ == "__main__":
main()