Web Scrape

Scrape content from a URL and return it in various formats.

Scrapes content from a URL and returns it in various formats including markdown, HTML, or screenshots.

POST /v1/tools/web_scrape

Authentication

AuthorizationBearerrequired

API key as bearer token in the Authorization header. Create keys at Manage API Keys.

Headers

Content-Typestringrequired

Must be application/json.

Body

urlstringrequired

The URL to scrape.

formats(string | object)[]optional

Output formats to return. Default returns markdown. Each entry is either a format name string (e.g. "markdown") or an object { type: <name>, ...options } for formats that take options. See Format options below.

Allowed values:markdownsummaryhtmlrawHtmllinksimagesscreenshotjsonchangeTrackingbrandingaudiovideoquestionhighlights
onlyMainContentbooleanoptionalDefault false

Extract only the main content, excluding headers, footers, etc.

onlyCleanContentbooleanoptionalDefault false

Return cleaned content without ads and other noise.

includeTagsstring[]optional

HTML tags to include in extraction.

excludeTagsstring[]optional

HTML tags to exclude from extraction.

maxAgeintegeroptional

Maximum age of cached content in seconds. Minimum: 1.

minAgeintegeroptional

Minimum age of cached content in seconds. Minimum: 1.

headersobjectoptional

Custom headers to send with the request. Key-value pairs.

waitForintegeroptional

Time to wait for page to load in milliseconds. Minimum: 1.

mobilebooleanoptionalDefault false

Emulate mobile device.

skipTlsVerificationbooleanoptionalDefault false

Skip TLS certificate verification.

timeoutintegeroptional

Request timeout in milliseconds. Minimum: 1.

parsersstring[]optional

Parsers to use for content extraction.

actionsobject[]optional

Browser actions to perform before scraping (click, scroll, etc.).

locationobjectoptional

Geographic location settings with country (string) and languages (string[]).

removeBase64ImagesbooleanoptionalDefault false

Remove base64 encoded images from output.

blockAdsbooleanoptionalDefault false

Block ads during page load.

proxystringoptional

Proxy mode to use.

Allowed values:basicstealth
storeInCachebooleanoptionalDefault true

Store result in cache.

lockdownbooleanoptionalDefault false

Enable lockdown mode.

zeroDataRetentionbooleanoptionalDefault false

Do not retain any data after request.

Response

idstringoptional

Unique request ID for this call.

dataobjectoptional

Scraped content. Contains fields based on requested formats: markdown, summary, html, rawHtml, links, images, screenshot, json, changeTracking, branding, audio, video, question, highlights.

data.metadataobjectoptional

Page metadata including title, description, language, and sourceURL.

Format options

Formats that accept options must be passed as objects. The type field selects the format; other fields configure it.

json — structured extraction

{
  "type": "json",
  "schema": {
    "type": "object",
    "properties": {
      "title": { "type": "string" },
      "points": { "type": "number" }
    }
  }
}

Result appears at data.json.

question — ask a question about the page

{
  "type": "question",
  "question": "When was this article published and by whom?"
}

Result appears at data.question.

changeTracking — diff against previous scrape

{
  "type": "changeTracking",
  "modes": ["git-diff"]
}

Result appears at data.changeTracking.

Use Cases

  • Content extraction: Pull article text from news sites
  • SEO analysis: Extract metadata and link structure
  • Screenshots: Capture visual snapshots of pages
  • Data mining: Scrape structured data from web pages