goon/scripts/merge_exact_title_duration.py
jtrzupek e23e2d1f17 fix(merge): move playback_sources on scene merge + exact-title+duration dedup
merge_scenes never reassigned playback_sources → ON DELETE CASCADE dropped them
with the absorbed scene. Cross-source (canonical) merges rarely had tube playback
so it hid, but tube-dup merges silently LOST playback links. Add _move_playback_sources
(global unique (origin,page_url) guarantees no collision on reassign).

+ merge_exact_title_duration.py: catches missing-merge dupes bulk_dedup misses
(same performer + identical normalized title + identical duration_sec, no phash).
Bad Bella had 25 such pairs (bug-report ef92809d "duplikat, te same miniatury").

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-08 10:56:50 +02:00

98 lines
3.6 KiB
Python
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

"""Merge missing-merge duplikatów: ten sam performer + identyczny znormalizowany tytuł
+ identyczna długość (co do sekundy).
Kontekst: bulk_dedup łapie cross-source (tpdb↔stashdb) i exact-phash, ale NIE łapie
tube-dup bez fingerprintów (np. ta sama scena zescrapowana 2× pod różnym URL/slug).
Na stronie performera user widzi wtedy "te same miniatury, duplikat" (bug-report
ef92809d — Bad Bella miała 25 takich par). Sygnał `same performer + exact norm-title
+ exact duration_sec` jest praktycznie pewny (dwa różne wideo nie mają byte-identycznego
tytułu I długości co do sekundy).
Keep = scena z największą liczbą external_refs → potem playback_sources → potem najstarsza.
Merge przez resolve.scene_merge.merge_scenes (przenosi refs/performers/tags/fingerprints/
playback_sources — playback move dodany 2026-06-08 razem z tym skryptem).
Użycie (kontener worker):
python scripts/merge_exact_title_duration.py [PERFORMER_ID] [--commit]
Bez PERFORMER_ID = wszyscy performerzy (global). Bez --commit = dry-run.
"""
from __future__ import annotations
import sys
from sqlalchemy import text
from app.db import session_scope
from app.resolve.scene_merge import merge_scenes
def _args() -> tuple[str | None, bool]:
commit = "--commit" in sys.argv
pid = None
for a in sys.argv[1:]:
if a != "--commit" and len(a) >= 32:
pid = a
return pid, commit
def _groups(pid: str | None) -> list[list[str]]:
# Grupy scen (per performer) o identycznym lower(trim(title)) + duration_sec.
# member order: refs DESC, srcs DESC, created_at ASC → pierwszy = keeper.
where_perf = "AND sp.performer_id = :pid" if pid else ""
sql = f"""
WITH cand AS (
SELECT s.id,
sp.performer_id,
lower(btrim(s.title)) nt,
s.duration_sec dur,
s.created_at,
(SELECT count(*) FROM scene_external_refs r WHERE r.scene_id=s.id) refs,
(SELECT count(*) FROM playback_sources p WHERE p.scene_id=s.id) srcs
FROM scenes s
JOIN scene_performers sp ON sp.scene_id=s.id {where_perf}
WHERE s.duration_sec IS NOT NULL AND btrim(s.title) <> ''
)
SELECT array_agg(id::text ORDER BY refs DESC, srcs DESC, created_at ASC) members
FROM cand
GROUP BY performer_id, nt, dur
HAVING count(*) > 1
"""
params = {"pid": pid} if pid else {}
with session_scope() as s:
rows = s.execute(text(sql), params).all()
# dedup grup (ten sam zestaw może wyjść dla 2 performerów dzielących sceny)
seen: set[frozenset] = set()
out: list[list[str]] = []
for (members,) in rows:
key = frozenset(members)
if key in seen:
continue
seen.add(key)
out.append(list(members))
return out
def main() -> None:
pid, commit = _args()
groups = _groups(pid)
pairs = sum(len(g) - 1 for g in groups)
print(f"performer={pid or 'ALL'} groups={len(groups)} merges={pairs} commit={commit}", flush=True)
merged = 0
for g in groups:
keep = g[0]
for drop in g[1:]:
if not commit:
print(f" [dry] keep {keep[:8]} <- drop {drop[:8]}")
continue
try:
with session_scope() as s:
import uuid as _u
merge_scenes(s, keep_id=_u.UUID(keep), drop_id=_u.UUID(drop), resolved_by="merge_exact_title_duration")
merged += 1
except Exception as e:
print(f" ERR keep {keep[:8]} drop {drop[:8]}: {e}")
print(f"DONE merged={merged}/{pairs}", flush=True)
if __name__ == "__main__":
main()