porntrex/hqporner rejected for deep-crawl: KVS sites with no SSR metadata (77% of existing porntrex has no duration -> invisible under the app's >=60 filter). eporner instead exposes a public JSON API (api/v2/video/search) returning title + length_sec + keywords + added per video; ~100k videos, ~100/page, no per-scene detail fetch. - BaseBrowseScraper.crawl_page(page): factored out of latest_scenes; returns None (transient fail) / [] (catalog end) / [scenes]. API subclasses override it. - deep_crawl drives via crawl_page (supports HTML-listing AND API sources). - EpornerApiScraper: crawl_page hits the eporner API -> RawScene with duration+tags+ date+thumb+playback; registered in ALL_BROWSE_SCRAPERS. - Pilot (2 API pages): 192 new, 100% playable + tagged + visible (>=60); the <180s trailer filter dropped 6 short clips. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
200 lines
8.3 KiB
Python
200 lines
8.3 KiB
Python
"""BaseBrowseScraper — latest-vids browse mode (vs search-by-performer).
|
|
|
|
Wzorzec: tube'y typu shyfap/freshporno/porn00/fullmovies/pornxp mają bogatą
|
|
metadata (title, studio, performers, tags, duration, release_date, description)
|
|
na detail page'u — wystarczy do canonical fuzzy match w resolverze. Browse mode
|
|
iteruje "latest" page (sorted by upload date) i fetchuje detail per scene.
|
|
|
|
Różnica vs `BaseSearchScraper`:
|
|
- **search**: tube wyszukuje sceny po performer name (dla performer-driven
|
|
backfill). Wymaga znanego performera.
|
|
- **browse**: tube listuje newest scenes (latest-vids endpoint). Nie wymaga
|
|
żadnego query — chodzi o świeże sceny independent of performer state.
|
|
|
|
Browse jest komplementarny do search:
|
|
- search łapie sceny dla **znanych performerów** (TPDB/StashDB → tube)
|
|
- browse łapie **świeże sceny** których performer może być new dla nas
|
|
(nowicjuszka w branży nie jeszcze w TPDB → mamy ją z browse → później
|
|
canonical TPDB ingest mergeuje)
|
|
|
|
Subclass dostarcza HTML parsing (listing → scene URLs + detail → RawScene).
|
|
"""
|
|
from __future__ import annotations
|
|
|
|
import abc
|
|
import io
|
|
import logging
|
|
import re
|
|
from collections.abc import Iterator
|
|
|
|
import httpx
|
|
|
|
from app.connectors.base import RawFingerprint, RawPlaybackSource, RawScene
|
|
from app.connectors.direct_scrapers.base import BaseDirectTubeScraper
|
|
from app.extractors import browser_get
|
|
|
|
log = logging.getLogger(__name__)
|
|
|
|
|
|
class BaseBrowseScraper(BaseDirectTubeScraper, abc.ABC):
|
|
"""Subclass dostarcza listing/detail parsing. Base flow:
|
|
1. for page in 1..max_pages:
|
|
2. GET listing_url(page)
|
|
3. extract scene URLs
|
|
4. for each URL:
|
|
5. GET scene detail page
|
|
6. parse → RawScene with rich metadata
|
|
7. yield
|
|
"""
|
|
|
|
_timeout: float = 30.0
|
|
"""HTTP timeout per request."""
|
|
|
|
@abc.abstractmethod
|
|
def _listing_url(self, page: int) -> str:
|
|
"""URL listing page'a 'latest-vids' (page 1 = newest)."""
|
|
|
|
@abc.abstractmethod
|
|
def _extract_scene_urls(self, listing_html: str) -> list[str]:
|
|
"""Lista absolutnych URL-i scen z listing HTML, w kolejności od najnowszej."""
|
|
|
|
@abc.abstractmethod
|
|
def _parse_detail(self, scene_url: str, detail_html: str) -> RawScene | None:
|
|
"""Parsuj scene detail HTML → RawScene z metadata.
|
|
|
|
Zwraca None gdy scena niedostępna / parse fail — caller pominie ten URL,
|
|
nie aborti całe browse."""
|
|
|
|
def crawl_page(self, page: int) -> list[RawScene] | None:
|
|
"""Crawl JEDNEJ strony listingu → lista RawScene. Wspólne dla browse_latest
|
|
(top-N) i deep_crawl (kursor). Zwraca:
|
|
None — transient fetch-fail listingu (caller: stop, NIE oznaczaj exhausted),
|
|
[] — pusty listing = koniec katalogu (caller: exhausted),
|
|
[...] — sceny z tej strony.
|
|
|
|
API-based subclasses (np. EpornerApiScraper) override'ują crawl_page bezpośrednio
|
|
(call API zamiast listing→detail). HTML browse subclasses dostarczają
|
|
_listing_url/_extract_scene_urls/_parse_detail i używają tej domyślnej impl.
|
|
"""
|
|
url = self._listing_url(page)
|
|
try:
|
|
res = browser_get(url, timeout=self._timeout)
|
|
html = res.text if hasattr(res, "text") else res
|
|
except Exception as e:
|
|
log.warning("%s browse listing fetch failed (page %d): %s", self.sitetag, page, e)
|
|
return None
|
|
|
|
urls = self._extract_scene_urls(html)
|
|
if not urls:
|
|
return []
|
|
|
|
log.info("%s browse page %d: %d scene URLs", self.sitetag, page, len(urls))
|
|
out: list[RawScene] = []
|
|
for scene_url in urls:
|
|
try:
|
|
res = browser_get(scene_url, timeout=self._timeout)
|
|
detail_html = res.text if hasattr(res, "text") else res
|
|
except Exception as e:
|
|
log.info("%s detail fetch failed %s: %s", self.sitetag, scene_url, e)
|
|
continue
|
|
try:
|
|
raw = self._parse_detail(scene_url, detail_html)
|
|
except Exception as e:
|
|
log.warning("%s detail parse failed %s: %s", self.sitetag, scene_url, e)
|
|
continue
|
|
if raw is not None:
|
|
out.append(raw)
|
|
return out
|
|
|
|
def latest_scenes(self, *, max_pages: int = 5) -> Iterator[RawScene]:
|
|
"""Iteruje sceny od najnowszych: page 1..max_pages (browse_latest forward-fill).
|
|
Deep-crawl używa crawl_page() z kursorem osobno. Stop na None/[] (fail/koniec)."""
|
|
for page in range(1, max_pages + 1):
|
|
scenes = self.crawl_page(page)
|
|
if not scenes: # None (fetch fail) lub [] (pusty listing = koniec) → stop
|
|
break
|
|
yield from scenes
|
|
|
|
# Stub `search()` — BaseDirectTubeScraper wymaga implementacji. Dla browse-only
|
|
# tubes nie supportujemy performer-driven search; zwracamy pusty iterator. Tube'y
|
|
# które chcą *oba* tryby mogą override'ować search() osobno.
|
|
def search(
|
|
self,
|
|
query: str,
|
|
*,
|
|
page: int = 1,
|
|
limit: int | None = None,
|
|
) -> Iterator[RawScene]:
|
|
return iter(())
|
|
|
|
|
|
_META_RE_CACHE: dict[str, re.Pattern[str]] = {}
|
|
|
|
|
|
_PHASH_UA = (
|
|
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
|
|
"(KHTML, like Gecko) Chrome/140.0.0.0 Safari/537.36"
|
|
)
|
|
|
|
|
|
def compute_thumbnail_phash(thumbnail_url: str, *, referer: str | None = None, timeout: float = 15.0) -> str | None:
|
|
"""Download thumbnail + return 64-bit perceptual hash (16-char hex) lub None.
|
|
|
|
Format pasuje do `SceneFingerprint.value` w DB (TPDB/StashDB importują ten sam
|
|
8x8 phash). Resolver Path 3 `find_by_phash_within` matchuje Hamming ≤5 (default).
|
|
|
|
Wymaga lazy importu `imagehash`/`PIL` — żeby moduł browse_base importował się
|
|
nawet gdy te lib-y są niedostępne (graceful degradation: phash=None → resolver
|
|
spadnie do composite scoring, jak gdyby fingerprintu nie było).
|
|
"""
|
|
try:
|
|
from PIL import Image
|
|
import imagehash
|
|
except ImportError:
|
|
log.warning("imagehash/Pillow nie zainstalowane — phash skipped")
|
|
return None
|
|
|
|
headers = {"User-Agent": _PHASH_UA}
|
|
if referer:
|
|
headers["Referer"] = referer
|
|
try:
|
|
with httpx.Client(timeout=timeout, follow_redirects=True) as c:
|
|
r = c.get(thumbnail_url, headers=headers)
|
|
if r.status_code != 200 or not r.content:
|
|
return None
|
|
img = Image.open(io.BytesIO(r.content))
|
|
# phash domyślnie hash_size=8 → 64-bit hash → 16 hex chars. Mode 'L' (greyscale)
|
|
# robi to wewnętrznie, ale niektóre webp/animated mogą mieć multi-frame —
|
|
# convert() bierze pierwszą klatkę, którą imagehash i tak zredukuje do grey.
|
|
return str(imagehash.phash(img.convert("RGB")))
|
|
except Exception as e:
|
|
log.info("phash compute failed for %s: %s", thumbnail_url, e)
|
|
return None
|
|
|
|
|
|
def meta_content(html: str, *, property: str | None = None, name: str | None = None) -> str | None:
|
|
"""Wyciąga zawartość <meta property=X content=Y> lub <meta name=X content=Y>.
|
|
|
|
Standardowy helper dla scraperów które używają OpenGraph / ya:ovs / itp.
|
|
Cache compiled regex w module-scope dict (te same selectory powtarzają się).
|
|
|
|
NB: separate patterns dla `"..."` i `'...'` content quote — wcześniej jeden
|
|
`[^"\']*` regex tnął title po wewnętrznym apostrofie (np. `<meta content="She's So Insatiable">`
|
|
→ `She`, bug-report 2026-05-20). Teraz matchujemy dokładnie ten sam quote co opening.
|
|
"""
|
|
key = f"prop:{property}" if property else f"name:{name}"
|
|
if key not in _META_RE_CACHE:
|
|
attr = "property" if property else "name"
|
|
val = re.escape(property or name or "")
|
|
# double-quoted content (HTML standard) — preferred
|
|
# Pattern: <meta property="X" content="...inner..." > — inner allows apostrophes
|
|
_META_RE_CACHE[key] = re.compile(
|
|
rf'<meta[^>]+{attr}=["\']{val}["\'][^>]*?content="([^"]*)"'
|
|
rf'|<meta[^>]+{attr}=["\']{val}["\'][^>]*?content=\'([^\']*)\'',
|
|
re.IGNORECASE,
|
|
)
|
|
m = _META_RE_CACHE[key].search(html)
|
|
if not m:
|
|
return None
|
|
val = m.group(1) if m.group(1) is not None else m.group(2)
|
|
return val.strip() if val else None
|