Commit graph

23 commits

Author SHA1 Message Date
jtrzupek
1ca503b7be feat(ingest): add xnxx browse scraper (JSON-LD only, alongside search)
Browse over /best/<YYYY-MM>/<page> (SSR; xnxx has no clean /new/ and its homepage is
JS-rendered) for a latest-feed freshness signal next to the performer-driven search
scraper. JSON-LD VideoObject only — xnxx detail (unlike its xvideos twin) doesn't
expose /models/ or /tags/ in SSR, so performers/tags come via canonical merge + the
search scraper. Title is html.unescaped (JSON-LD ships &comma;/&excl; entities).

xhamster and sxyprn intentionally left search-only: xhamster Cloudflare-blocks the
VPS on listing pages (1KB challenge), sxyprn has no clean SSR listing (IP-bound) —
a flaky browse scraper would be worse than the working search + 168h watchdog.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-24 15:52:32 +02:00
jtrzupek
2051fc1ded feat(ingest): add youporn browse scraper (JSON-LD only, alongside search)
Browse over /browse/time/?page=<n> (SSR) for guaranteed latest-feed freshness next to
the existing performer-driven search scraper. JSON-LD VideoObject only (title /
duration / uploadDate / thumbnail) — deliberately NOT scraping performers/tags from
the detail page: JSON-LD has no actor field and the /pornstar//category links are
sidebar-polluted with no scene-scoped container, so a naive regex attached the same
2 pornstars to every scene. Performers/tags come via canonical merge + the search
scraper instead.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-24 15:47:58 +02:00
jtrzupek
55612e262b feat(ingest): add browse scrapers for porntrex + mypornerleak (alongside search)
Both were search-only — fresh only as long as the performer queue cycles and the
site search keeps working. Added browse scrapers next to the existing search ones
(xvideos/eporner pattern: search keeps performer back-catalog coverage, browse
guarantees latest-feed freshness → watchdog 48h instead of 168h):
- porntrex: KVS /latest-updates/<n>/ (title + thumb + phash)
- mypornerleak: WP REST /wp-json/wp/v2/posts?_embed=1 (title + date + studio from
  category + performers from the actors taxonomy)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-24 15:41:22 +02:00
jtrzupek
a10c51aebf feat(ingest): revive porndish — search→WP REST API browse
Watchdog flagged porndish as frozen (search ?s= stopped yielding new scenes
2026-05-07, 1151h). It's WordPress and the VPS can reach it, so converted to a browse
scraper over the WP REST API (/wp-json/wp/v2/posts?_embed=1), same pattern as
perverzija: title, date, featured thumbnail, studio (category — FreeUseFantasy /
I Have A Wife / … paysite content) and tags. Performers via canonical merge. Playback
unchanged (embed iframe → phone-side). 60 fresh scenes on first crawl.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-24 15:09:27 +02:00
jtrzupek
b3ecf7141a feat(ingest): revive perverzija — search→WP REST API browse
Search (?s=) started returning 429 and the homepage is JS-rendered (no post links in
raw HTML), so the old search scraper got 0 (frozen since 2026-05-07). perverzija is
WordPress and the VPS can reach it (200, not CF-blocked), so converted to a browse
scraper over the WP REST API (/wp-json/wp/v2/posts?_embed=1): one structured call per
page gives title, date, featured thumbnail, studio (category — DadCrush/FamilyStrokes/
… TeamSkeet-family paysite re-ups) and genre tags. Performers via canonical merge
(stars taxonomy isn't REST-exposed; title carries names). Playback unchanged (embed
iframe → phone-side). 15 fresh + 45 refreshed on first crawl.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-22 13:10:16 +02:00
jtrzupek
cbb2390a2a feat(sources): remove 0dayxx + pornditt + pornhat entirely
Three orphan-factory tubes (0–0.2% canonical match — auto-screenshot thumbs and
slug titles that never match TPDB/StashDB) — to be replaced by better sources.
Removed scrapers (files + imports), extractors (registry + modules), the pornhat
entry from tag-enrichment priority lists and the 0dayxx display override, and purged
the DB (19,003 playback_sources + 9,904 solo-orphan scenes; shared mirror scenes keep
their other sources). The pornhat-based enrich_studio endpoint stays as a graceful
no-op (no pornhat sources → returns no studio).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-22 12:23:29 +02:00
jtrzupek
2f3e57c0ac feat(ingest): revive fpoxxx — search→browse (KVS /new-N/)
fpo.xxx is a KVS site, not WordPress, so the old `?s=` search scraper matched
nothing (frozen since 2026-05-07). Converted to a browse scraper reading /new-<n>/
(title + duration + thumbnail + phash from the listing tile; performers via canonical
merge). Playback was already phone-side (KVS). 32 fresh scenes on first crawl.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-22 12:04:05 +02:00
jtrzupek
90e391e255 feat(sources): remove pornhub + redtube entirely
Both scrapers were disabled since 2026-05-12 (~0.4% canonical match — mostly short
amateur clips that never match studio content); their data sat frozen. Removed for
good: deleted the extractor registry entries, scraper files and imports, dropped them
from the tag-enrichment priority lists, and purged the DB (17,906 playback_sources +
122 scenes that had no other source; mirror scenes shared with other tubes just lost
the ph/rt link).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-22 11:55:08 +02:00
jtrzupek
f34a75f4c6 feat(ingest): disable hqfap/4k69 (broken playback), latestpornvideo → browse
- hqfap + 4k69: both ingested fresh but playback is dead (hqfap serves a fixed
  ~3MB "server down" stub for every scene; 4k69 resolves no playable URL).
  Removed from ALL_BROWSE_SCRAPERS so no new dead sources get ingested; existing
  live playback_sources marked dead in prod (scenes drop out of has_playback /
  Sites). Extractors kept in registry for easy re-enable if the hosts recover.
- latestpornvideo: was a performer-search scraper, so it never picked up the
  site's "latest" feed — users saw a stale set. Converted to a browse scraper
  reading /page/N/ (studio+date from title/thumb, category tags; performers via
  canonical merge). Moved DIRECT → BROWSE list.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-22 09:34:47 +02:00
jtrzupek
ac84da92a4 feat(siska): convert to browse scraper, re-enable (search broken site-side)
siska's ?s= search ignores the query (returns latest regardless), so the performer-driven search scraper always yielded 0 and was disabled. Rewrote SiskaScraper as a latest-browse scraper (BaseBrowseScraper, /page/<n>/) and moved it to ALL_BROWSE_SCRAPERS. The listing tile carries everything (no detail fetch): title, duration (MM:SS span), thumbnail (img data-src), performer + studio (img alt 'Performer - Title - Studio'), category (thumbnail path). Playback unchanged: fresh videos embed playmogo + luluvid, resolved phone-side via _embed_iframe. Verified ingest: 26 seen / 11 new / 15 updated / 0 errors — and 15 updated means siska scenes match existing canonical scenes, adding playback coverage rather than orphans. Now covered by the browse ingest-watchdog (48h) and the 6h browse-latest + deep-crawl jobs. Old self-player videos (player.siska.video -> cfglobalcdn, ~2018) are dead and age out.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-20 16:25:11 +02:00
jtrzupek
0b6f663528 investigate(siska): keep disabled — site search is broken (ignores query)
Revisited siska re-enable (user fa4083a2). Findings: (1) fresh siska videos (videoID 227xxx) embed playmogo + luluvid and ARE phone-resolvable; updated siska.py scene regex + extractor path to the current video.php?videoID= format (old /<slug>/ format is gone). (2) BUT siska's ?s=<query> search is broken site-side — it returns the latest videos regardless of query (angela white == riley reid == homepage), so as a performer-driven BaseSearchScraper it always yields 0 (title token filter rejects everything). Reviving siska would require converting it to a browse/latest scraper (changes ingest character) — left as a decision. Old self-player videos (player.siska.video -> cfglobalcdn) are dead. Scraper stays disabled.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-20 16:15:02 +02:00
jtrzupek
956a0feb22 docs: correct Bright Data proxy type (ISP, flat-rate not per-GB)
It is an ISP proxy (static ISP IPs, flat billing), not residential —
so HTML-ingest bandwidth is free and the full deep-crawl is fine.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 19:18:40 +02:00
jtrzupek
21bc8bf1fe feat(superporn): browse scraper via Bright Data residential proxy
superporn hard-blocks the VPS IP with Cloudflare 403 on every TLS
impersonation, so HTML ingest routes through Bright Data residential
(BRIGHTDATA_PROXY_URL, parsed in config). First scraper to use a proxy:
optional _proxy on the browse base, threaded into browser_get.

JSON-LD VideoObject (title/desc/uploadDate/thumb/duration) + pornstar
and category chips; superporn double-encodes HTML entities so titles
are unescaped twice. Thumbnails fetch fine from the VPS (no proxy).

Playback stays off-proxy: the <source> mp4 token is IP-bound to the
fetcher, so resolve is phone-side via WebView (extractor superporncom
-> _vps_blocked_fallback), same as porndoe.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 18:47:45 +02:00
jtrzupek
80fd83cb4e feat(tubes): add 4k69 + neporn browse scrapers, shared PlayTube base
4k69.com (~65k scenes): same PlayTube CMS as hqfap - common logic moved
to _playtube.py (sitemap catalog, JSON-LD, pills). Studio classified by
matching category pills against the studios index page. Streams are
get_file (fullmovies family) returned unresolved with mobile_direct,
2160p skipped.

neporn.com: KVS engine, latest-updates listing, JSON-LD + video:duration
meta, performers from models links with flashvars video_tags fallback
for fresh uploads. Resolve via _kvs; final URL portable cross-IP.

superporn.com rejected: Cloudflare 403 from VPS on all TLS impersonations.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 18:15:13 +02:00
jtrzupek
6de986b9a7 feat(hqfap): browse scraper + native mp4 extractor (~120k scenes)
PlayTube CMS. Sitemap-based pagination (listing has no GET paging),
JSON-LD VideoObject metadata, pornstar/category pills, " Clips"
categories mapped to studio. Direct mp4 (cdnde.com/okcdn.ru), tokens
time-bound and portable cross-IP, so mobile plays direct.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 17:51:04 +02:00
jtrzupek
a196fcbcdb refactor(ingest): rename scraper Source name "pornapp" -> "tube-scraper"
The umbrella Source.name for all direct tube scrapers (deep-crawl, browse-latest,
performer-driven) was "pornapp" — a misleading leftover from the removed external
porn-app API. It read like a dependency on a third-party "pornapp" service; it is
not — these are our own scrapers hitting 25+ tubes directly (kind=scraper,
origin tube:<sitetag>). Renamed to "tube-scraper" via a single SCRAPER_SOURCE_NAME
constant; DB row renamed in place (UPDATE name, same id) so all ingest_runs +
external_records history stays linked. No behavior change — external_id keying
(sitetag:url) and dedup are unaffected.

NOTE: playback_sources.origin "pornapp:<sitetag>" prefix is a separate legacy
format (resolve_playback parses it) and is intentionally left untouched.

Verified on prod: row renamed (0 stray "pornapp"), new runs land on "tube-scraper".

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-07 16:54:55 +02:00
jtrzupek
e42217773f feat(deep-crawl): xvideos browse source (capped) + per-tube page cap
xvideos SSR's JSON-LD VideoObject (duration/title/uploadDate) + on-page /models/ (perf)
+ /tags/. Sample: median ~10.5min, 93% >=3min. Pilot (2 pages): 29 new, 100% playable +
visible + tagged (performers sparse — xvideos 'new' is amateur-heavy; /models/ tagged
mostly on studio rips).

- XVideosBrowseScraper (JSON-LD + page-parse models/tags), in ALL_BROWSE_SCRAPERS.
- deep_crawl._PAGE_CAP: per-sitetag depth cap; xvideoscom=1800 (~newest 50k). At the cap
  the tube is marked exhausted (reset -> incremental re-sweep) so a mega-tube cannot
  monopolize the round-robin or balloon the DB.
- ported yesporn.py into the public repo (was prod-only, like hdporngg) ending the
  __init__ public/prod divergence.

youporn rejected: JSON-LD lacks actor/keywords, its /pornstar//category/ links are A-Z
nav not scene-specific. xhamster: 429/Cloudflare from the VPS IP.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-03 11:16:44 +02:00
jtrzupek
ee4915770f feat(deep-crawl): eporner via JSON API as SSR-rich source (Phase 2b alternative)
porntrex/hqporner rejected for deep-crawl: KVS sites with no SSR metadata (77% of
existing porntrex has no duration -> invisible under the app's >=60 filter). eporner
instead exposes a public JSON API (api/v2/video/search) returning title + length_sec
+ keywords + added per video; ~100k videos, ~100/page, no per-scene detail fetch.

- BaseBrowseScraper.crawl_page(page): factored out of latest_scenes; returns None
  (transient fail) / [] (catalog end) / [scenes]. API subclasses override it.
- deep_crawl drives via crawl_page (supports HTML-listing AND API sources).
- EpornerApiScraper: crawl_page hits the eporner API -> RawScene with duration+tags+
  date+thumb+playback; registered in ALL_BROWSE_SCRAPERS.
- Pilot (2 API pages): 192 new, 100% playable + tagged + visible (>=60); the <180s
  trailer filter dropped 6 short clips.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-03 10:37:20 +02:00
jtrzupek
cd12348782 fix(movies): paradisehill delta date-granularity + browse cadence docs
- paradisehill.fetch_movies compared release_date coerced to midnight against the
  `since` timestamp, so the chronological crawl stopped at the first upload dated
  the same calendar day as `since` and silently dropped most new movies (0-2 seen
  per run; Movies tab stalled). Compare by DATE with a 1-day grace instead; idempotent
  external_records upsert dedups the re-fetched recent window.
- scripts/backfill_paradisehill_movies.py: one-off no-delta deep crawl to recover the
  backlog missed during the bug (idempotent, resumable).
- docs: correct stale 'raz dziennie/24h' browse-latest comments to 6h (4x/day), the
  actual configured cadence (config.py sched_browse_latest_hours=6).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-01 17:00:10 +02:00
jtrzupek
da7fcda132 feat(ingest): SQL phash match, tag inference + backfill, clip-store skip, browse tubes, watchdog
Resolver/perf:
- find_by_phash_within: nearest match via Postgres bit_count over bit(64) XOR
  instead of Python scan of all phash fingerprints (~20x faster per scene;
  unblocks long delta runs that were killed mid-run before since advanced).

Scheduler/reliability:
- reap ingest_runs stuck in 'running' on worker startup (killed_by_restart).
- smoke_test: per-source ingest health, stuck-run and browse-freshness checks
  -> Sentry; exclude killed_by_restart from the failed-run alarm.

Tags (ingest with tags + fill blanks):
- wire infer_tag_slugs into normalize_scene so tube scenes get title-inferred
  tags (was dead code); union with connector tags.
- scripts/backfill_inferred_tags.py: keyset/batched/idempotent backfill for
  existing tagless scenes (playable tag coverage 16% -> ~52%).

Clip-store:
- skip ManyVids/IWantClips/Clips4Sale/... from canonical sources at ingest
  (GOON_SKIP_CLIP_STORE, default on) — permanent orphans, ~56% of canonical
  ingest, never have a free-tube playback source.

Browse tubes:
- enable fullmovies + hdporn.gg: studio parsed from title prefix instead of
  the /networks/ sidebar (which always yielded the first listed network);
  drop phash compute (pilot: 0% canonical hit within Hamming 5 — auto-screenshots),
  matching relies on title/performer/duration.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-01 15:07:35 +02:00
https://github.com/goon-foss/goon
2fad46f934 filemoon: resurrect via mobile-side resolver (Byse SPA RE)
filemoon (+ mirrory kerapoxy/lvturbo/emturbovid/bysezoxexe/bysezejataos)
nie umarł — ~2026-05 zrobił rebrand na Vite SPA "Byse Frontend". Stary
P.A.C.K.E.R.-JWPlayer embed zniknął, więc backend uznał go za martwego i
wpisał na DEAD_HOSTER_RE. RE bundla index-ChwZgmXV.js (2026-05-22):

  POST /api/videos/<code>/embed/playback  body {"fingerprint":{}}
  → {"playback":{"key_parts":[..],"iv":..,"payload":..}}
  → key=concat(b64url(key_parts)); AES-256-GCM(key,iv,payload) → JSON
  → sources[*].url = HLS master.m3u8

Browser-attestation jest opcjonalny — pusty fingerprint wystarcza.
Stream URL jest IP-bound (token wiąże się z IP requestera), więc resolve
musi iść z urządzenia użytkownika (jak doodstream.ts / packerHoster.ts).

- mobile/src/lib/aesGcm.ts — pure-JS AES-256-GCM decrypt (RN/Hermes nie
  ma Web Crypto); S-box liczony z GF(2^8), GHASH weryfikuje tag.
  Zweryfikowane przeciw cryptography (Python) na 2 payloadach.
- mobile/src/lib/filemoonHoster.ts — resolver: POST playback → decrypt →
  pick best source. E2E test: filemoon.to/e + /d + bysezoxexe.com mirror.
- PlayerScreen: filemoon w resolve useEffect obok doodstream/packer.
- backend: filemoon poza DEAD_HOSTER_RE; hoster.py early-return → przelot
  jako type='hoster' do mobile resolvera (server-side resolve bezcelowy,
  bo URL IP-bound do VPS).
- direct_scrapers: poprawiony błędny komentarz "filemoon shutdown".

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 13:18:26 +02:00
https://github.com/goon-foss/goon
642f1ab8b8 Mobile 0.1.9: OTA enable, WebView cookie-dismiss fix, porndoe connector
Mobile / OTA:
- Enable Expo Updates (app.json + AndroidManifest) → api.goon-foss.org
- Bump 0.1.6 → 0.1.9 (build.gradle, app.json, appVersion.ts, main.py /version)
- backend.ts: default public backend auto-connect (no manual login)

WebView fallback fix (PlayerScreen INJECTED_JS):
- Auto-dismiss cookie/consent gates (hqporner et al. blocked kt_player init)
- Context-scoped: only clicks consent buttons inside cookie/gdpr containers
- Retry window for <source>.src polling raised 5→15 ticks (post-dismiss init)

Resolver:
- Series-position + modifier mismatch detector (Episode 2≠4, BTS/unedited)
  → composite_score hard-reject / cap; wired into scene_score + bulk_dedup
- aggregator-mode candidate query: LIMIT 500 + title-match ordering

Connectors:
- porndoe.com browse scraper (JSON-LD VideoObject) — theporndude audit pilot

landing: APK links → goon-v0.1.9.apk

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 11:20:57 +02:00
goon-foss
ad0284585b Initial commit
Goon — self-hosted aggregator for adult-content scene metadata.

Indexes scenes from TPDB, StashDB, and 30+ public adult tube sites.
Cross-source deduplication via perceptual hash + Levenshtein distance.
FastAPI backend + APScheduler worker + React Native (Expo) mobile client.

FOSS, ad-free, donation-funded. See README for details.
2026-05-20 10:10:22 +02:00