Back to Blog

How to Scrape Google Search Results with Scrapeless Scraping Browser: Organic Results, PAA, Knowledge Panels, and AI Overview

Alex Johnson
Alex Johnson

Senior Web Scraping Engineer

27-Apr-2026

Key Takeaways:

  • One CLI, every Google surface. The scrapeless-scraping-browser workflow scrapes Google organic results, Featured Snippets, People Also Ask, Knowledge Panels, Related Searches, and AI Overview — all from the same session pattern. End-to-end verified on Ubuntu, 2026-04-24 (138 organic containers per page on the scrapeless query).
  • Google's wait strategy. wait --load networkidle returns in ~14 s because Google trackers never settle; use a fixed wait 5000 instead, and verify readiness by counting div[data-ved][data-hveid] containers (≥ 8 means the SERP rendered).
  • Discover → extract with union selectors. Google rotates the snippet container across A/B variants (div.VwiC3b, div[data-content-feature="1"], .lEBKkf, span.st). Query each field independently and zip by card — a strict "must have h3 AND anchor AND snippet" rule loses 4/10 results to ads and video cards.
  • Featured Snippet moved out of .kno-rdesc. On height/measurement queries Google now renders the answer as a per-attribute Knowledge Panel fragment (span.T286Pc). The resilient pattern is a selector cascade followed by a body-text regex fallback — shown in Step 4.

Google Search is the load-bearing surface for SEO rank tracking, competitive intelligence, brand SOV, AI-grounding pipelines, and LLM eval datasets. Scraping it reliably in 2026 means handling four moving parts: residential-IP routing past the /sorry/index rate limit, a fixed-wait strategy because trackers never go fully idle, union selectors against rotating A/B class names, and a feature-by-feature extraction pattern (organic, Featured Snippet, People Also Ask, Knowledge Panel, Related Searches, AI Overview).

Google's /sorry/index CAPTCHA wall fires at variable rates — anywhere from ~1-in-10 on a calm day to nearly every US-egress request during heavy load — depending on the proxy pool, time of day, and target geo. DE/GB/JP/FR/CA proxies generally have a higher pass rate than US for Google specifically (see Step 1). The Scrapeless Scraping Browser handles residential proxies, anti-detection fingerprinting, and JavaScript rendering as session-level concerns, so the pipeline code focuses on selectors and waits.

This post is a CLI-first, verification-grounded walkthrough through the scrapeless-scraping-browser cloud browser. Every selector, wait threshold, and failure pattern below is backed by an Ubuntu verification run on 2026-04-24 — Google-specific claims for organic extraction, pagination, localization, classic-SERP suppression, AI Overview polling, Knowledge Panel, PAA, and Related Searches.


What You Can Do With It

  • SERP rank tracking on Google. Track positions for a keyword set, build a per-domain visibility score, and pin top-N results per query per timestamp.
  • Competitive keyword intelligence. Pull the top 10 for a competitor's target queries, diff host lists, and identify SERP wins your own SEO isn't capturing.
  • AI answer grounding. Harvest Google's AI Overview citations, Featured Snippet attribution, and People Also Ask pairs to build the exact evidence set that LLM-powered search tools surface to end users.
  • Knowledge Panel extraction. Pull entity fact sheets via Google's data-attrid attribute map — 20+ structured fields per entity for the Albert Einstein verification query — suitable for feeding a knowledge-graph or entity-enrichment pipeline.
  • Multi-locale monitoring. Query the same keyword from hl=de&gl=de, hl=en&gl=us, and hl=ja&gl=jp through geo-matched residential proxies to capture market-by-market SERP differences.
  • LLM eval datasets. Build deterministic ground-truth datasets for evaluating retrieval-augmented generation systems by pinning the top-N per query per timestamp.

Why Scrapeless Scraping Browser

Scrapeless Scraping Browser is a customizable, anti-detection cloud browser designed for web crawlers and AI Agents. For Google Search specifically, it brings:

  • Residential proxies in 195+ countries (--proxy-country, --proxy-state, --proxy-city) — datacenter IP ranges are filtered aggressively by Google's edge; residential egress is the load-bearing primitive for sustained scraping.
  • Integrated CAPTCHA Solver.
  • Anti-detection fingerprinting on every session — Google's SearchGuard client-side checks treat the browser as real Chrome.
  • JavaScript rendering in the cloud — Google's SERP is hydrated; static HTML isn't enough.
  • Per-session locale alignment via --timezone and --languages — automatic with the proxy geography.

Get your API key on the free plan at scrapeless.com. Related Scrapeless products: Universal Scraping API, Proxy Solutions, and the Scrapeless MCP Server for Model Context Protocol integrations.

If a structured-JSON API fits your pipeline better than a browser, see the companion Google Search Scraper API guide.


Prerequisites

  • Node.js 18 or newer.
  • A Scrapeless account and API key — sign up at scrapeless.com.
  • jq for JSON parsing (recommended).
  • Basic familiarity with the terminal.

Install

The recipes below run on the scrapeless-scraping-browser CLI. Setup is three steps — both CLI users and AI-agent users need #1 and #2; AI-agent users do #3 too.

1. Install the CLI package

bash Copy
npm install -g scrapeless-scraping-browser

This provides the scrapeless-scraping-browser binary that every step of this post calls. The skill does not bring its own runtime — it loads command patterns into your AI agent, but the CLI itself must be installed first.

2. Configure your API key

Get your token from scrapeless.com, then store it where the CLI can read it:

bash Copy
scrapeless-scraping-browser config set apiKey your_api_token_here
scrapeless-scraping-browser config get apiKey   # verify

Using an AI agent? The skill's instructions explicitly tell your agent that authentication is required before any session call. If the API key isn't set when the agent first tries to use the CLI, the agent will prompt you and run the config set apiKey … command for you — you can either set it manually now (commands above) or paste your token when the agent asks.

The config file lives at ~/.scrapeless/config.json with access restricted to the current user, takes priority over the environment variable, and is portable across agents and CI runners. For CI pipelines, prefer:

bash Copy
export SCRAPELESS_API_KEY=your_api_token_here

3. Install the Scrapeless skill in your AI agent

This is a separate step from step 1 above. Step 1 installed the CLI binary — the runtime your agent invokes. The skill is what teaches your agent how to invoke it correctly (selectors, waits, retry patterns, the discover→extract workflow). They're two different things, and you need both.

The skill is a folder containing SKILL.md + skill.json + references/. The canonical source is the scrapeless-ai/scrapeless-agent-browser → skills/scraping-browser-skill repo on GitHub.

To install it in Claude Code, Cursor, VS Code + GitHub Copilot, OpenAI Codex CLI, or Gemini CLI, follow the Scrapeless AI Agent install guide — it has the per-agent copy-paste commands (bash and Windows PowerShell). Reload your agent after install so the skill becomes active.

Without the skill installed, your agent doesn't know the discover→extract pattern, the per-engine waits, or the selectors that actually work in 2026, and you'd have to spoon-feed every detail in every prompt.

What the skill loads into your agent's operating context up-front:

  • Authentication — check for ~/.scrapeless/config.json or SCRAPELESS_API_KEY and prompt you to set it if missing (see step 2).
  • Discover → Extract workflowthe anti-fragility pattern. The agent reads the live DOM with get html "<region>" first, identifies stable anchors (data-ved, data-attrid, aria-label, role, semantic ids), then writes eval selectors based on what's actually rendered — instead of guessing utility class names that Google rotates across A/B variants every few weeks.
  • Selector syntax — CSS (div[data-ved][data-hveid]) vs accessibility refs (@e1 from snapshot -i).
  • Google wait strategywait 5000 plus a div[data-ved][data-hveid] count check as the readiness signal. The agent picks this default rather than trusting wait --load networkidle, which never settles on Google.
  • Parallel CLI workers — single-shell && chaining, unique session names, ≤3 concurrent workers per host. --session-id alone is not sufficient under daemon contention.
  • Common pitfallseval returns JSON-quoted values, open exits non-zero on successful navigation, wait --load networkidle race on cold sessions, sessions terminate when the connection closes.
  • Full command reference — every flag for new-session, open, wait, eval, get, click, fill, snapshot, auth, profile, recording, stop, etc.

4. Verify the skill is wired up

Before your first real Google scrape, smoke-test the install with one safe prompt:

"Using the Scrapeless skill, open https://example.com and tell me the page title."

Your agent should mint a session, open the page, and reply with "Example Domain". If that works, you're ready to scrape Google.

If it fails:

Symptom Likely cause Fix
"I don't have a tool/skill to do that" Skill not loaded in this agent session Reinstall via the skill install guide and reload the agent
Authentication failed / 401 API key not set Re-run scrapeless-scraping-browser config set apiKey <token> (Install step 2)
command not found CLI binary missing on PATH Re-run Install step 1 (npm install -g scrapeless-scraping-browser)
Lands on /sorry/index (Google CAPTCHA wall) US proxy under load Ask the agent to retry on a DE/GB/JP/FR/CA proxy — the skill knows to rotate
Hangs / lands on chrome://new-tab-page/ Cold-session wait race Agent should retry — wait 1500 between open and wait --load networkidle is in the skill's playbook

How you actually use this: prompt your agent

After install, you scrape Google by talking to your agent — not by copy-pasting bash. The skill loads union selectors, the Google-tuned wait strategy, and the discover→extract pattern into the agent's context, so a one-line prompt is enough to get clean SERP JSON back.

Prompts you can paste

You say to your agent What you get back
"Scrape the top 10 Google results for 'best running shoes'" JSON list, organic only, fields {position, title, url, displayedUrl, snippet}
"Scrape Google for 'airpods' across pages 1–3, deduped, save as airpods-serp.json" Single JSON file, 3 SERP pages merged + deduped by URL
"What's Google's featured snippet for 'how tall is the eiffel tower'?" The answer text, with selector cascade + regex body fallback
"Extract the Knowledge Panel for Albert Einstein" JSON map of data-attrid → value (born, died, spouse, etc.)
"Get all People Also Ask questions for 'best running shoes' and their answers" Array of {question, answer}
"Track AI Overview presence on 'what is machine learning' for 5 polls 30 s apart" Loop with content-length guard, returns presence + body for each poll
"What's Google showing for 'wetter' on a German IP?" Session minted with --proxy-country DE, native-language results
"Featured snippet for 'is bitcoin legal in japan' — return only the answer text" Raw answer string, no boilerplate
"Get the related-searches strip for 'machine learning'" Array of 8–10 query suggestions from the bottom-of-page strip
"Scrape Google news vertical for 'electric vehicles' last week" URL with &tbm=nws&tbs=qdr:w applied automatically
"Force the classic SERP layout for 'what is python' — no AI Overview" URL with &udm=14 applied; clean 10-organic layout

Worked example: scrape Google for "airpods" across 3 pages

You type:

"Scrape Google for 'airpods' across pages 1–3, deduped by URL, save as airpods-serp.json. Just organic results — title, url, snippet."

The agent's plan (in plain English):

  1. Mint three US-egress sessions, one per page (clean state per page beats reusing one session that drifts).
  2. Open https://www.google.com/search?q=airpods&hl=en&gl=us&start={0,10,20}, then wait 5000 (Google trackers never go fully idle, so a fixed wait beats networkidle).
  3. Confirm div[data-ved][data-hveid] count ≥ 8 — the loaded-signal that the SERP actually rendered.
  4. eval the union-selector extractor (div.VwiC3b, div[data-content-feature="1"], .lEBKkf, span.st, .MUxGbd) — Google rotates these across A/B variants, querying all of them survives the rotation.
  5. Filter to title && url && snippet, dedupe by URL, write the file.

What you get back (airpods-serp.json, abbreviated):

json Copy
[
  { "page": 1, "title": "AirPods", "url": "https://www.apple.com/airpods/",
    "snippet": "AirPods deliver an unparalleled wireless headphone experience..." },
  { "page": 1, "title": "Apple AirPods Wireless Ear Buds, Bluetooth Headphones ...",
    "url": "https://www.amazon.com/Apple-AirPods-Charging-Latest-Model/dp/B07PXGQC1Q",
    "snippet": "The new AirPods combine intelligent design with breakthrough technology..." },
  { "page": 1, "title": "AirPods", "url": "https://en.wikipedia.org/wiki/AirPods",
    "snippet": "AirPods are wireless Bluetooth earbuds designed by Apple..." },
  { "page": 2, "title": "Best AirPods for 2026: Expert Tested and Reviewed",
    "url": "https://www.cnet.com/tech/mobile/best-apple-airpods/",
    "snippet": "Pros - Lightweight, more compact design and comfortable fit..." },
  { "page": 3, "title": "What to Expect From the Next AirPods Pro, Launching as ...",
    "url": "https://www.macrumors.com/2026/04/22/airpods-pro-cameras-2026/",
    "snippet": "The infrared cameras could recognize hand gestures..." }
  // 15 more rows, 20 total deduped
]

That's the entire user-facing surface. The selector unions, per-page session minting, and dedupe logic in Steps 3–5 below are what the skill makes the agent run — you don't have to type any of them.

Shaping prompts: how to control what comes back

Small phrasings change what the agent extracts and how it returns it.

Phrasing Effect
"…return JSON" / "…as CSV" Output format
"…fields: title, url only" Restricts the fields the agent extracts
"…across pages 1–5" / "…top 50" Pagination depth — fresh session per page
"…save to <name>.json" Writes to file
"…organic only" / "…drop ads/PAA/shopping" Filter — agent applies the strict title && url && snippet subset
"…on a German IP" / "…from Sydney" Sets --proxy-country / --proxy-city
"…then for each result also fetch the page title" Chains a second pass per result
"…poll every 5 minutes for 1 hour" Loop — fresh session per iteration

That's the workflow. Steps 1–7 below are not a copy-paste recipe — they're the under-the-hood reference for understanding why the agent picks DE for Google, why wait 5000 beats networkidle, and so on. Read them once to internalize the patterns; then trust your agent to apply them. Scripting outside an agent is possible (the bash works as shown) but is not the recommended path — the skill is the product.


Step 1 — Open a session with the right geography

Search engines personalise by IP location. Set the proxy geography before the browser opens any page — it cannot change mid-session.

bash Copy
SESSION=$(scrapeless-scraping-browser new-session \
  --name "search-de" \
  --ttl 1800 \
  --proxy-country DE \
  --json | jq -r '.data.taskId')

echo "Session: $SESSION"

Portable fallback without jq:

bash Copy
SESSION=$(scrapeless-scraping-browser new-session \
  --name "search-de" --ttl 1800 --proxy-country DE --json \
  | grep -oE '"taskId":"[^"]*"' | cut -d'"' -f4)

Why DE for Google specifically? Google's /sorry/index rate-limit fires intermittently on US residential proxy bursts (verified empirically across multiple test runs in 2026-04). DE residential egress hit zero /sorry redirects on the same query set during the verification round and remained the reliable default for every Google feature in Steps 3–6. Use the URL's hl= and gl= parameters to control which locale Google personalises for (e.g. hl=en&gl=us for US-locale English results) independently of the proxy geography.

Residential-proxy allocations occasionally return a transient 503 on the first attempt — retry once. Allocation latency is usually a few seconds for US/DE; less-populated geos may be longer. Treat the first session of a run as a probe and retry if it doesn't return a taskId.


Step 2 — Pick the right wait strategy for Google

wait --load networkidle is the obvious default but doesn't settle reliably on Google: the SERP emits continuous ad/tracker XHRs, networkidle requires a 500 ms quiet window before firing, and Google's quiet windows are rare. The recommended pattern is a fixed wait 5000 plus a container-count readiness check.

Strategy Behavior on Google Recommendation
wait --load networkidle Returns in ~14 s, adds dead time Avoid for Google
wait 5000 (fixed 5 s) Deterministic — page rendered by then Default
eval 'document.querySelectorAll("div[data-ved][data-hveid]").length' ≥ 8 True readiness signal — SERP rendered Use as a gate before extraction
bash Copy
scrapeless-scraping-browser --session-id $SESSION open \
  "https://www.google.com/search?q=scrapeless&hl=en&gl=us"
scrapeless-scraping-browser --session-id $SESSION wait 5000

If you prefer to prove the page actually finished rendering rather than trust a timer, the signal is the organic-container count, not body bytes. A fully-rendered Google SERP for the niche query scrapeless was 4,868 bytes of text but still had 138 organic containers in the verification run — body length is not a reliable loaded-signal for specialist terms.

bash Copy
# Loaded-signal: organic containers ≥ 8
scrapeless-scraping-browser --session-id $SESSION eval \
  'document.querySelectorAll("div[data-ved][data-hveid]").length'

Step 3 — Extract Google organic results with union selectors

Google returns 10 organic results per SERP. The naive selector — "container must have h3 AND anchor AND snippet" — yields only 6 because 4 of the 10 are non-snippet cards (video previews, shopping ads, Twitter blocks). The resilient pattern is to query each field independently and zip by container.

bash Copy
scrapeless-scraping-browser --session-id $SESSION eval '
(function(){
  const out = [];
  document.querySelectorAll("div[data-ved][data-hveid]").forEach((r, i) => {
    const h3 = r.querySelector("h3");
    const a  = r.querySelector("a[href^=\"http\"]");
    const sn = r.querySelector(
      "div.VwiC3b, div[data-content-feature=\"1\"], .lEBKkf, span.st, .MUxGbd"
    );
    const cite = r.querySelector("cite");

    // Skip containers that have NO title and NO anchor (decorative boxes)
    if (!h3 && !a) return;

    out.push({
      position:     i + 1,
      title:        h3?.textContent?.trim() || null,
      url:          a?.href || null,
      displayedUrl: cite?.textContent?.trim() || null,
      snippet:      sn?.textContent?.trim()?.slice(0, 300) || null,
    });
  });
  return JSON.stringify(out);
})()
' > google-organic.json

jq '. | length' google-organic.json                              # raw pass — expect many
jq 'map(select(.title and .url)) | length' google-organic.json   # organic + feature subset
jq 'map(select(.title and .url and .snippet))' google-organic.json  # canonical organic only

Honest observations from the verification run (query scrapeless, 2026-04-24 Ubuntu):

  • The raw pass yielded 80 containers under the data-ved][data-hveid combo after the "has title or url" filter. Google returns many card/feature containers that share this attribute pair.
  • Applying title != null && url != null narrows to 16 items — the useful working set that includes organic results + PAA entries + Knowledge Panel links.
  • Applying title && url && snippet narrows further to ~10 — the canonical organic subset. This is the strict 10-per-SERP list.
  • Pick the filter appropriate to your use case: rank-trackers usually want the canonical 10; context-mining pipelines usually want the 16-item working set.
  • cite count was 14 on the same page — expect cite to match or exceed the canonical organic count since some features (PAA entries, ads) also render a cite.

Step 4 — Extract SERP features (Featured Snippet, PAA, Knowledge Panel, Related)

Google layers several feature types on top of the organic list. Each has its own extraction pattern.

Google has been progressively moving answer text out of the classic .kno-rdesc container into per-attribute Knowledge Panel fragments (span.T286Pc for measurements, for example). The resilient pattern is a selector cascade plus a body-text regex fallback.

bash Copy
scrapeless-scraping-browser --session-id $SESSION open \
  "https://www.google.com/search?q=how+tall+is+the+eiffel+tower&hl=en&gl=us"
scrapeless-scraping-browser --session-id $SESSION wait 5000

scrapeless-scraping-browser --session-id $SESSION eval '
(function(){
  // Pass 1 — selector cascade (classic → modern)
  const selectors = [
    ".kno-rdesc",
    "[data-attrid=\"wa:/description\"]",
    ".IZ6rdc",
    ".hgKElc",
    "span.T286Pc",               // 2026 — per-attribute fragments
    "[data-attrid] .LrzXr",      // Knowledge Panel factoids
  ];
  for (const sel of selectors) {
    const el = document.querySelector(sel);
    if (el && el.textContent.trim().length > 10) {
      return JSON.stringify({ source: "selector", selector: sel, text: el.textContent.trim() });
    }
  }

  // Pass 2 — body-text regex fallback for numeric answers
  const body = document.body.innerText;
  const m = body.match(/([0-9][0-9,\. ]*(meters|feet|metres|m|ft|km|miles|°F|°C|%)[^\.]{0,40})/i);
  if (m) return JSON.stringify({ source: "regex", text: m[0].trim() });

  return JSON.stringify({ source: null, text: null });
})()
'

4b — People Also Ask

bash Copy
scrapeless-scraping-browser --session-id $SESSION open \
  "https://www.google.com/search?q=best+running+shoes&hl=en&gl=us"
scrapeless-scraping-browser --session-id $SESSION wait 5000

scrapeless-scraping-browser --session-id $SESSION eval '
(function(){
  const out = [];
  document.querySelectorAll(".related-question-pair, div[jsname=\"N760b\"]").forEach(q => {
    const text = q.textContent.trim();
    if (text.length > 5) out.push({ question: text.slice(0, 200) });
  });
  return JSON.stringify(out);
})()
'

Count was 5 questions on the "best running shoes" query in the verification run. For each question, click to expand and re-snapshot to extract the answer body.

4c — Knowledge Panel via data-attrid map

Entity queries (people, places, companies, landmarks) render a Knowledge Panel — a structured attribute map that is one of the most stable surfaces on Google in 2026.

bash Copy
scrapeless-scraping-browser --session-id $SESSION open \
  "https://www.google.com/search?q=Albert+Einstein&hl=en&gl=us"
scrapeless-scraping-browser --session-id $SESSION wait 5000

scrapeless-scraping-browser --session-id $SESSION eval '
(function(){
  const attrs = {};
  document.querySelectorAll("div[data-attrid]").forEach(el => {
    const key = el.getAttribute("data-attrid");
    const val = el.textContent.trim().replace(/\s+/g, " ").slice(0, 200);
    if (key && val && !attrs[key]) attrs[key] = val;
  });
  return JSON.stringify({
    title:       document.querySelector("[data-attrid=\"title\"]")?.textContent?.trim() || null,
    attrCount:   Object.keys(attrs).length,
    attrs:       attrs,
  });
})()
'

20 attributes on "Albert Einstein" in the Ubuntu verification run; later runs against GB/US proxies reported 15–18 — proxy-side localization affects which factoids render, not whether the selector works. Keys follow a schema like kc:/people/person:born, kc:/people/person:died, kc:/people/person:spouse.

bash Copy
scrapeless-scraping-browser --session-id $SESSION eval '
(function(){
  const out = [];
  document.querySelectorAll("#bres a, .brs_col a, .AJLUJb, [data-reltq]").forEach(a => {
    const t = a.textContent.trim();
    if (t.length > 1 && t.length < 80) out.push(t);
  });
  return JSON.stringify(out.slice(0, 10));
})()
'

10 related queries returned on "Albert Einstein" — the bottom-of-page strip is a reliable query-expansion source for topic mining.


Step 5 — Pagination, localization, and the classic SERP

5a — Pagination via start=N

Google paginates with &start=0, &start=10, &start=20 …up to &start=90 (10 pages practical depth; the &num= parameter was disabled in September 2025 and every page now returns exactly 10 organic results).

Mint a fresh session per page — Google's scroll-history machinery degrades result quality within a single session across multiple pagination hops.

bash Copy
for START in 0 10 20 30 40 50; do
  SID=$(scrapeless-scraping-browser new-session \
    --name "gs-page-$START" --ttl 300 --proxy-country DE --json \
    | jq -r '.data.taskId')

  scrapeless-scraping-browser --session-id $SID open \
    "https://www.google.com/search?q=scrapeless&hl=en&gl=us&start=$START"
  scrapeless-scraping-browser --session-id $SID wait 5000

  scrapeless-scraping-browser --session-id $SID eval '
    JSON.stringify(Array.from(
      document.querySelectorAll("div[data-ved][data-hveid] a[href^=\"http\"]")
    ).slice(0, 10).map(a => a.href))
  ' > "google-page-$START.json"

  scrapeless-scraping-browser stop $SID >/dev/null 2>&1
  sleep 2
done

In the verification run, page 2 (start=10) returned 10 URLs with zero overlap to page 1 — clean pagination.

5b — Localization (hl, gl)

bash Copy
# German user in Germany
DE_SID=$(scrapeless-scraping-browser new-session \
  --name "gs-de" --ttl 600 --proxy-country DE --json | jq -r '.data.taskId')

scrapeless-scraping-browser --session-id $DE_SID open \
  "https://www.google.com/search?q=wetter&hl=de&gl=de"
scrapeless-scraping-browser --session-id $DE_SID wait 5000

Proxy country (DE) and URL parameters (hl=de&gl=de) should match. Mismatched combinations can trigger the consent wall on .eu traffic (consent.google.com/*) — click the Reject/Accept button via eval to dismiss before extraction.

5c — Classic layout via udm=14 (AI Overview suppression)

Appending &udm=14 forces Google's "classic" SERP — no AI Overview, just organic. Useful for reproducibility and for pipelines that need a stable 10-organic layout.

bash Copy
scrapeless-scraping-browser --session-id $SESSION open \
  "https://www.google.com/search?q=what+is+machine+learning&hl=en&gl=us&udm=14"
scrapeless-scraping-browser --session-id $SESSION wait 5000

Verified: udm=14 returned 40 organic containers and zero AI Overview on the test query — a clean classic layout.

Other useful parameters:

Parameter Effect
&tbm=nws News vertical
&tbm=shop Shopping vertical
&udm=2 Images vertical (replaced tbm=isch in 2026)
&tbs=qdr:w Past week only
&tbs=qdr:d Past 24 hours only
&udm=14 Classic SERP (no AI Overview)

Step 6 — AI Overview (SGE) extraction — non-deterministic

AI Overview renders for some queries but not others, and the same query may or may not return one on repeat — render rate varies by topic, locale, account state, and session. Design the pipeline to accept both outcomes (present: true and present: false) as normal.

bash Copy
AI_SID=$(scrapeless-scraping-browser new-session \
  --name "gs-ai" --ttl 600 --proxy-country DE --json | jq -r '.data.taskId')

scrapeless-scraping-browser --session-id $AI_SID open \
  "https://www.google.com/search?q=what+is+machine+learning&hl=en&gl=us"
scrapeless-scraping-browser --session-id $AI_SID wait 5000

# Poll up to 10 s — AI Overview renders asynchronously.
# IMPORTANT: the container selectors match a placeholder element on queries
# where AI Overview was NOT served. Always guard on textContent length > 100
# before declaring "present=true" — otherwise you get false positives with
# text_len=0, cites=0.
for i in 1 2 3 4 5; do
  PRESENT=$(scrapeless-scraping-browser --session-id $AI_SID eval '
    (function(){
      const ai = document.querySelector(
        "[data-subtree=\"gw\"], .yp, .LT6XE, [aria-label*=\"AI Overview\"], [jsname=\"uIYcDb\"]"
      );
      if (!ai || ai.textContent.trim().length < 100) return "no";
      return "yes";
    })()
  ' | tail -1 | tr -d '"')
  [ "$PRESENT" = "yes" ] && break
  sleep 2
done

scrapeless-scraping-browser --session-id $AI_SID eval '
(function(){
  const ai = document.querySelector(
    "[data-subtree=\"gw\"], .yp, .LT6XE, [aria-label*=\"AI Overview\"], [jsname=\"uIYcDb\"]"
  );
  // Content-length guard: the container selectors false-positive on a
  // placeholder element when AI Overview is not actually rendered.
  if (!ai || ai.textContent.trim().length < 100) {
    // Secondary body-text heuristic before declaring absence
    const bodyHit = /AI Overview|Generated with AI/i.test(document.body.innerText);
    return JSON.stringify({ present: false, body_heuristic: bodyHit });
  }
  const citations = Array.from(ai.querySelectorAll("a[href^=\"http\"]"))
    .slice(0, 8)
    .map(a => ({ url: a.href, text: a.textContent.trim().slice(0, 80) }));
  return JSON.stringify({
    present: true,
    text:    ai.textContent.trim().slice(0, 800),
    citations,
  });
})()
'

In verification runs, the naive querySelector approach false-positived on what is machine learning — the placeholder element exists whether or not AI Overview actually rendered, returning present: true, text_len: 0, cites: 0. The content-length guard above (textContent.length < 100 ⇒ treat as absent) is mandatory, not optional.

Production pipelines should:

  1. Accept present: false as normal — not a failure.
  2. Optionally re-poll on a fresh session after 60+ seconds — AI Overview presence varies across short time windows.
  3. Use &udm=14 to force classic SERP when AI Overview is actively unwanted (Step 5c).
  4. Log body_heuristic: true cases — these are useful signal when the container selectors miss but the body text confirms AI Overview was rendered. Those warrant a selector-discovery pass in Step 2 style to catch the new container.

Step 7 — Scaling: isolate per-worker CLI state

An important gotcha caught during verification: the Scrapeless CLI does not isolate daemon state across parallel shells on the same host. Running three shells that each mint their own session via new-session and then navigate concurrently collapses to a single winner — the other two shells end up querying the same underlying browser context, and two of the three eval calls return 0 organic nodes.

This is a CLI-level local-state issue, not a session-ID issue. Passing --session-id correctly on every call is not sufficient; the CLI's shared daemon, PID/port files, and session-pool cache on the local host override --session-id under parallel load.

The actually-working primitives (verified across 10+ parallel CLI agents on 2026-04-26):

  1. Single-shell && chaining — chain every CLI call for one job in a single atomic shell invocation; other workers cannot interleave between your steps. This is the load-bearing primitive.
  2. Unique session names per worker — the daemon port is hashed from the name; unique names dodge port collisions.
  3. Cap at ~3 concurrent workers per host — empirically beyond that, transient chrome://new-tab-page/, ERR_TUNNEL_CONNECTION_FAILED, and "session terminated" errors compound.
  4. USERPROFILE/HOME env vars are documented in the upstream skill but do NOT isolate the Rust v0.1.1 binary on Windows in verification. Don't depend on them. For more fan-out, shard across hosts.

Fallback patterns:

  • Shard across hosts. Once you exceed ~3 in-flight workers per host, move additional workers to separate machines (each host gets its own daemon). This still works because daemon state is per-host, not per-account.
  • Sequential per host. Run one SERP fetch at a time per host; queue the rest. Simple, slower, and plenty for small pipelines.

Concurrency without single-shell chaining is fine for ≤ 1 in-flight request at a time. Don't push past that unless every worker's full call sequence (new-session && open && wait && eval) lives in one atomic shell command.


What You Get Back

The canonical schema for a Google SERP poll looks like this. The organic[0] values below are the live response captured for the query scrapeless on a DE-egress session (2026-04-27 verification):

json Copy
{
  "query": "scrapeless",
  "timestamp": "2026-04-27T15:42:00Z",
  "locale": { "hl": "en", "gl": "us" },
  "organic": [
    {
      "position": 1,
      "title": "Scrapeless: Effortless Web Scraping Toolkit",
      "url": "https://www.scrapeless.com/",
      "displayedUrl": "https://www.scrapeless.com",
      "snippet": "Scrapeless offers AI-powered, robust, and scalable web scraping and automation services trusted by leading enterprises. Our enterprise-grade solutions are…"
    }
  ],
  "organicCount":   10,
  "citeCount":      14,
  "featuredSnippet": null,
  "peopleAlsoAsk":   [],
  "knowledgePanel":  null,
  "relatedSearches": [],
  "aiOverview":     { "present": false },
  "errors":         []
}

Honest observations:

  • organicCount will frequently be 10 but the total "card" count on the page can be 100+ — filter strictly on h3 + anchor presence.
  • featuredSnippet.source in production pipelines should be one of "selector-classic", "selector-attrid", "selector-T286Pc", or "regex-body-text", so downstream consumers know how much to trust each field.
  • aiOverview.present: false is a valid state, not an error — AI Overview is non-deterministic per query.

Need structured JSON without DOM work? Use the Scraper API

The Scraping Browser approach above gives you full flexibility — you control the selectors, the wait strategy, and the exact JSON shape. If you'd rather skip the DOM entirely and receive structured Google SERP JSON directly, Scrapeless ships a dedicated Google Search Scraper API:

Surface Scraping Browser (this post) Google Search Scraper API
Control over selectors Full — you write every querySelector None — the API returns a fixed JSON schema
DOM exploration cost You read live HTML first None — you send {q, hl, gl} and get JSON back
Latency per request 2 s mint + 5–10 s render Single HTTP roundtrip
Best for Custom fields, SERP features, AI Overview, non-standard layouts Structured rank-tracking, high-QPS pipelines
Cost model Billed per session-minute Billed per API call
Concurrency Single-shell && chaining + unique session names; cap ~3 workers/host (Step 7) API-side — no browser state to manage

For pipelines that monitor a small keyword set on a steady schedule, the API is typically cheaper and simpler. For custom selector work, AI Overview harvesting, and non-standard layouts, the Scraping Browser is the flexible path.

Conclusion

Scraping Google Search in 2026 is no longer about collecting static HTML. It requires a workflow that can handle rotating SERP layouts, fixed wait logic, locale-aware sessions, and feature-level extraction across organic results, People Also Ask, Knowledge Panels, Featured Snippets, Related Searches, and AI Overview.

Scrapeless Scraping Browser gives teams a practical way to do that in production. With residential proxies, anti-detection fingerprinting, JavaScript rendering, and session-level geo control, it reduces the maintenance burden of scraping Google while keeping the pipeline flexible enough for custom selectors and non-standard layouts. For teams that want a simpler structured-data path, the Scrapeless Google Search Scraper API is the faster option.

Ready to Scrape now?

Join our vibrant community to claim a $5-10 free plan and connecting with fellow innovators:

Scrapeless Official Discord Community
Scrapeless Official Telegram Community


FAQ

Q: Can I avoid proxies?
A: Not reliably. Datacenter IP ranges are filtered aggressively by Google's edge, and request patterns from a single IP attract throttling quickly. Residential proxies (--proxy-country DE for Google, see Step 1) are the load-bearing primitive for sustained scraping.

Q: Why does wait --load networkidle take ~14 seconds on Google?
A: Google emits continuous ad/tracker XHRs — networkidle requires a 500 ms quiet window before firing, and Google's quiet windows are rare. Use the fixed wait 5000 from Step 2 plus a div[data-ved][data-hveid] count check rather than trusting networkidle.

Q: Why use Scrapeless Scraping Browser for Google scraping instead of a local browser setup?
Because Google Search is highly dynamic and anti-bot heavy. Scrapeless handles proxy routing, fingerprinting, and rendering in the cloud, which makes the scraping workflow much more stable than maintaining local Playwright or Puppeteer jobs.

Q: Can Scrapeless extract AI Overview, Featured Snippets, and People Also Ask?
Yes. The browser workflow is designed for feature-level extraction, so you can pull organic results, Featured Snippets, PAA, Knowledge Panels, Related Searches, and AI Overview from the same session pattern.

Q: When should I use the Google Search Scraper API instead?
Use the API when you want structured JSON with less selector work and lower operational overhead. Use Scraping Browser when you need custom extraction, layout resilience, or direct control over how each SERP feature is read.

Q: Does Scrapeless support multi-locale Google scraping?
Yes. You can align proxy geography, language, and timezone per session, which helps keep results consistent across different markets and local SERP variations.

At Scrapeless, we only access publicly available data while strictly complying with applicable laws, regulations, and website privacy policies. The content in this blog is for demonstration purposes only and does not involve any illegal or infringing activities. We make no guarantees and disclaim all liability for the use of information from this blog or third-party links. Before engaging in any scraping activities, consult your legal advisor and review the target website's terms of service or obtain the necessary permissions.

Most Popular Articles

Catalogue