Missing-merge duplicates (same performer + identical normalized title + identical duration-to-the-second) that bulk_dedup misses — tube re-scrapes and cross-tube re-ingests like porn00 pulling a video already present from xnxx (reports 28fe8181/32df33b1). Extracted the proven merge_exact_title_duration logic into app/scheduler/title_duration_dedup.py (script now a thin wrapper), wired a 12h scheduler job (playback-only = what users actually see, GOON_SCHED_TITLE_DEDUP_HOURS). Signal is near-certain (two different videos don't share byte-identical title AND exact duration); no shared performer = not merged (over-match guard). Verified: job registers (jobs=14), backlog currently 0 after the one-shot global merge. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
49 lines
1.6 KiB
Python
49 lines
1.6 KiB
Python
"""Merge missing-merge duplikatów: ten sam performer + identyczny znormalizowany tytuł
|
|
+ identyczna długość (co do sekundy).
|
|
|
|
Logika w app/scheduler/title_duration_dedup.py (współdzielona ze schedulerem
|
|
`_job_title_duration_dedup`). Ten plik to cienki CLI wrapper.
|
|
|
|
Użycie (kontener worker):
|
|
python scripts/merge_exact_title_duration.py [PERFORMER_ID] [--commit] [--playback-only]
|
|
Bez PERFORMER_ID = wszyscy (global). Bez --commit = dry-run (wypisuje pary).
|
|
"""
|
|
from __future__ import annotations
|
|
|
|
import sys
|
|
|
|
from app.scheduler.title_duration_dedup import _groups, run_title_duration_dedup
|
|
|
|
|
|
def main() -> None:
|
|
commit = "--commit" in sys.argv
|
|
playback_only = "--playback-only" in sys.argv
|
|
pid = None
|
|
for a in sys.argv[1:]:
|
|
if not a.startswith("--") and len(a) >= 32:
|
|
pid = a
|
|
|
|
if not commit:
|
|
groups = _groups(pid, playback_only)
|
|
pairs = sum(len(g) - 1 for g in groups)
|
|
print(
|
|
f"performer={pid or 'ALL'} playback_only={playback_only} "
|
|
f"groups={len(groups)} merges={pairs} commit=False",
|
|
flush=True,
|
|
)
|
|
for g in groups:
|
|
for drop in g[1:]:
|
|
print(f" [dry] keep {g[0][:8]} <- drop {drop[:8]}")
|
|
print(f"DONE merged=0/{pairs} errors=0", flush=True)
|
|
return
|
|
|
|
res = run_title_duration_dedup(pid=pid, playback_only=playback_only, commit=True)
|
|
print(
|
|
f"DONE merged={res['merged']}/{res['merges']} errors={res['errors']} "
|
|
f"groups={res['groups']}",
|
|
flush=True,
|
|
)
|
|
|
|
|
|
if __name__ == "__main__":
|
|
main()
|