Web Scrape
Scrape content from a URL and return it in various formats.
Scrapes content from a URL and returns it in various formats including markdown, HTML, or screenshots.
POST /v1/tools/web_scrapeAuthentication
AuthorizationBearerrequiredAPI key as bearer token in the Authorization header. Create keys at
Manage API Keys.
Headers
Content-TypestringrequiredMust be application/json.
Body
urlstringrequiredThe URL to scrape.
formats(string | object)[]optionalOutput formats to return. Default returns markdown. Each entry is
either a format name string (e.g. "markdown") or an object
{ type: <name>, ...options } for formats that take options. See
Format options below.
markdownsummaryhtmlrawHtmllinksimagesscreenshotjsonchangeTrackingbrandingaudiovideoquestionhighlightsonlyMainContentbooleanoptionalDefault falseExtract only the main content, excluding headers, footers, etc.
onlyCleanContentbooleanoptionalDefault falseReturn cleaned content without ads and other noise.
includeTagsstring[]optionalHTML tags to include in extraction.
excludeTagsstring[]optionalHTML tags to exclude from extraction.
maxAgeintegeroptionalMaximum age of cached content in seconds. Minimum: 1.
minAgeintegeroptionalMinimum age of cached content in seconds. Minimum: 1.
headersobjectoptionalCustom headers to send with the request. Key-value pairs.
waitForintegeroptionalTime to wait for page to load in milliseconds. Minimum: 1.
mobilebooleanoptionalDefault falseEmulate mobile device.
skipTlsVerificationbooleanoptionalDefault falseSkip TLS certificate verification.
timeoutintegeroptionalRequest timeout in milliseconds. Minimum: 1.
parsersstring[]optionalParsers to use for content extraction.
actionsobject[]optionalBrowser actions to perform before scraping (click, scroll, etc.).
locationobjectoptionalGeographic location settings with country (string) and languages (string[]).
removeBase64ImagesbooleanoptionalDefault falseRemove base64 encoded images from output.
blockAdsbooleanoptionalDefault falseBlock ads during page load.
proxystringoptionalProxy mode to use.
basicstealthstoreInCachebooleanoptionalDefault trueStore result in cache.
lockdownbooleanoptionalDefault falseEnable lockdown mode.
zeroDataRetentionbooleanoptionalDefault falseDo not retain any data after request.
Response
idstringoptionalUnique request ID for this call.
dataobjectoptionalScraped content. Contains fields based on requested formats:
markdown, summary, html, rawHtml, links, images,
screenshot, json, changeTracking, branding, audio, video,
question, highlights.
data.metadataobjectoptionalPage metadata including title, description, language, and sourceURL.
Format options
Formats that accept options must be passed as objects. The type field
selects the format; other fields configure it.
json — structured extraction
{
"type": "json",
"schema": {
"type": "object",
"properties": {
"title": { "type": "string" },
"points": { "type": "number" }
}
}
}Result appears at data.json.
question — ask a question about the page
{
"type": "question",
"question": "When was this article published and by whom?"
}Result appears at data.question.
changeTracking — diff against previous scrape
{
"type": "changeTracking",
"modes": ["git-diff"]
}Result appears at data.changeTracking.
Use Cases
- Content extraction: Pull article text from news sites
- SEO analysis: Extract metadata and link structure
- Screenshots: Capture visual snapshots of pages
- Data mining: Scrape structured data from web pages