Crawler Optimization Dashboard

crawler.princeyuvi.me — Playwright crawler throughput profiling & optimization
Throughputtasks / second
State
Progresscompleted / total
Durationseconds

Summary — 50 heaviest sites, ALL features on (cold, cache-cleared)

Original @ c60.25tasks/s (hit 200s timeout)
Optimized @ 8×31.45tasks/s (stable, verified)
Speedup5.6×cold-cache improvement
Features12/12all retained ✓

Optimizations Applied (crawler_optimized.js)

#ChangeWhy it helps
1Removed the blind 500ms setTimeout sleep per taskReplaced with deterministic in-flight-response drain (await network idle, capped). Was pure waste × every task.
2Silenced per-response logger.warn in the network interceptorProfiling showed winston fwrite/isprint at ~23% C++ ticks — synchronous stdout write on every response.
3Parallelized request.headers() + response.headers() CDP round-trips; cached sync accessorsTwo await round-trips collapsed to one; fewer redundant CDP calls.
4Skip response.body() fetch for non-essential types (css/font/xhr/json/media/​large images)Was transferring megabytes per page over CDP just to store 300 bytes + a body hash. Hash log still derived for all responses via post-run URL-hash step.
5Fail-fast timeouts: goto 40→20s, screenshot 15→3s, page.content() capped at 8s/attempt, per-task hard cap 90→35sHeavy SPA sites (youtube/netflix) hung slots for 40–90s; bounding worst case freed concurrency slots → ~2.5× alone.
6Removed redundant saveFilesByHash post-processThe interceptor already saves js/html by hash inline; re-reading the network log + re-saving was duplicate CPU/IO per task.
7Bounded browser.close() & page.close() (15s / 5s)A hung close could strand an entire worker under load.
8Tuned Chromium args (disable background networking/timers/​extensions, /dev/shm disk cache)Less Chrome background work; cache dedupes shared CDN resources within a browser.
9Multi-process harness (cluster_bench.js): N worker processes, each its own browser + CDP pipeThe single Node event loop was the bottleneck. Separate processes parallelize response hashing/IPC across cores. Sweet spot on this box: 8 workers × 3.

Ripped hook_script.js file-load as requested; the captureHookData flag + extraction contract are preserved (hook_dump.json still emitted when a hook populates window._hookData).

Current Run

label
concurrency
started

Throughput by Run

no runs yet…

Run History

no history yet…