goon/app/scheduler/browse_latest.py
jtrzupek a196fcbcdb refactor(ingest): rename scraper Source name "pornapp" -> "tube-scraper"
The umbrella Source.name for all direct tube scrapers (deep-crawl, browse-latest,
performer-driven) was "pornapp" — a misleading leftover from the removed external
porn-app API. It read like a dependency on a third-party "pornapp" service; it is
not — these are our own scrapers hitting 25+ tubes directly (kind=scraper,
origin tube:<sitetag>). Renamed to "tube-scraper" via a single SCRAPER_SOURCE_NAME
constant; DB row renamed in place (UPDATE name, same id) so all ingest_runs +
external_records history stays linked. No behavior change — external_id keying
(sitetag:url) and dedup are unaffected.

NOTE: playback_sources.origin "pornapp:<sitetag>" prefix is a separate legacy
format (resolve_playback parses it) and is intentionally left untouched.

Verified on prod: row renamed (0 stray "pornapp"), new runs land on "tube-scraper".

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-07 16:54:55 +02:00

70 lines
2.6 KiB
Python
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

"""Browse-latest strategy — scrap newest scenes from rich-metadata tubes.
Pattern (vs performer-driven):
- performer-driven: backfill scen dla znanych performerów (TPDB/StashDB → tube).
Wymaga już znanego performera w DB.
- browse-latest: forward-fill świeżymi scenami z tube'ów (~100 najnowszych /
tube / dzień). Łapie sceny których performer może być new dla nas — później
canonical ingest dorobi metadata.
Każdy `BaseBrowseScraper.latest_scenes(max_pages=5)` yielduje RawScene z bogatą
metadata (studio + performers + duration + tags + description). Composite fuzzy
w resolverze ma więc dobre sygnały dla canonical match (vs orphan-only tubes
typu pornditt, gdzie był sam title + krótki opis).
Schedulowane przez `jobs.py` co 6h / 4×dobę (`sched_browse_latest_hours=6`).
"""
from __future__ import annotations
import logging
import time
from dataclasses import dataclass
from app.connectors.direct_scrapers import ALL_BROWSE_SCRAPERS, SCRAPER_SOURCE_NAME
from app.models.source import SourceKind
from app.scheduler.performer_driven import _ingest_iter_into_run
log = logging.getLogger(__name__)
@dataclass
class BrowseCounters:
scrapers_run: int = 0
total_seen: int = 0
total_new: int = 0
total_updated: int = 0
total_errors: int = 0
def run_browse_latest(*, max_pages: int = 5) -> BrowseCounters:
"""Iteruj wszystkie zarejestrowane browse scrapers, scrap latest N pages każdy."""
counters = BrowseCounters()
for scraper_cls in ALL_BROWSE_SCRAPERS:
scraper = scraper_cls()
t0 = time.time()
log.info("browse-latest: %s starting (max_pages=%d)", scraper.sitetag, max_pages)
try:
c = _ingest_iter_into_run(
source_kind=SourceKind.scraper,
source_name=SCRAPER_SOURCE_NAME,
run_label=f"browse-latest:{scraper.sitetag}",
iterator_factory=lambda s=scraper, mp=max_pages: s.latest_scenes(
max_pages=mp
),
)
counters.scrapers_run += 1
counters.total_seen += c.get("seen", 0)
counters.total_new += c.get("new", 0)
counters.total_updated += c.get("updated", 0)
counters.total_errors += c.get("errors", 0)
elapsed = time.time() - t0
log.info(
"browse-latest: %s done in %.1fs counters=%s",
scraper.sitetag, elapsed, c,
)
except Exception as e:
counters.total_errors += 1
log.exception("browse-latest: %s failed: %s", scraper.sitetag, e)
log.info("browse-latest done: %s", counters)
return counters