Initiate a one-time scrape of a specific webpage. This endpoint allows fine-grained control over what content is retrieved and how it's formatted.

🧰 Using with SDKs

Prefer code over curl? Crawlio offers official SDKs for seamless integration with your stack:

Node.js SDK (npm) – Perfect for backend automation, agents, and JS projects.
Python SDK (PyPI) – Ideal for data science, AI/ML workflows, and scripting.

📖 View full usage docs: 👉 Node.js SDK Docs 👉 Python SDK Docs

We are working on an extensive documentation on our SDKs. Thanks for your cooperation!

Cost

Name	Cost	Type
Scrape	1	Scrape

Using with REST API

🧪 POST /scrape

curl -X POST https://crawlio.xyz/api/scrape \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/blog/article",
    "exclude": ["footer", ".ads"],
    "includeOnly": [".article-body"],
    "markdown": true,
    "returnUrls": true
  }'

📥 Request

Endpoint:

POST https://crawlio.xyz/api/scrape

Headers:

Authorization: Bearer YOUR_API_KEY Content-Type: application/json

Request Body Parameters:

Field	Type	Required	Description
`url`	`string`	✅ Yes	The full URL of the webpage to be scraped.
`exclude`	`array of strings`	❌ No	CSS selectors of elements to remove from the scraped content.
`includeOnly`	`array of strings`	❌ No	CSS selectors to restrict scraping to only these elements.
`markdown`	`boolean`	✅ Yes	Whether to return the content as Markdown in addition to raw HTML.
`returnUrls`	`boolean`	✅ Yes	Whether to return a list of all URLs found on the page.
`workflow`	`array of objects`	❌ No	A sequence of browser actions to execute before scraping begins.
`userAgent`	`string`	❌ No	User-Agent string to simulate different browsers or devices.
`cookies`	`array of objects`	❌ No	cookies field allows you to send one or more cookies as part of the scrape request.

🧾 Example Request

{
  "url": "https://example.com/blog/article",
  "exclude": ["footer", ".ads"],
  "includeOnly": [".article-body"],
  "markdown": true,
  "returnUrls": true
}

📤 Response

On success, you'll receive a 200 OK response with the following JSON structure:

Field	Type	Description
`jobId`	`string`	Unique ID of the scraping job.
`url`	`string`	The target URL that was scraped.
`urls`	`array of strings`	List of URLs found on the page (if `returnUrls` is `true`).
`html`	`string`	The full HTML content after processing.
`markdown`	`string`	Markdown version of the content (if `markdown` is `true`).
`meta`	`object`	Metadata such as generator, viewport size, etc.
`baseUrl`	`string`	Base URL used for relative links.
`screenshots`	`object`	Captured screenshots if defined in workflow steps.
`evaluation`	`object`	Results or errors from custom `eval` scripts in workflow.

📦 Example Response

{
  "jobId": "abc123",
  "url": "https://example.com/blog/article",
  "urls": [
    "https://example.com/about",
    "https://example.com/contact"
  ],
  "html": "<html>...</html>",
  "markdown": "# Article Title\n\nContent goes here...",
  "meta": {
    "generator": "CrawlioBot v1.2",
    "viewport": "1280x720"
  },
  "baseUrl": "https://example.com"
}

What and Why?

The Scrape feature is designed for extracting content from a single webpage in a structured, flexible way. It's perfect when you want to capture data from a specific URL without crawling additional pages.

You can fine-tune the scrape with options like:

Include Only: Limit output to specific sections of the page (e.g., just the article body).
Exclude Elements: Remove unwanted parts such as ads, navigation bars, or footers.
Markdown Output: Convert the content into clean, readable Markdown — great for storage, readability, or further processing.
Return URLs: Get a list of all internal/external links found on the page for further crawling or indexing.

This feature is ideal for tasks like:

Blog or news article extraction
Product detail scraping
Content archiving
Clean page capture for summarization or AI processing

You can initiate the scrape using the /scrape endpoint through any SDK or directly via the REST API.

⚙️ Workflow Feature (Advanced Page Interaction)

The workflow field enables a powerful way to control browser behavior before scraping. Define a sequence of interactions — such as scrolling, clicking, waiting, or evaluating JavaScript — to ensure the page is in the right state before content extraction begins.

🧠 Supported Actions

Type	Description
`wait`	Pause execution for a specified time (in milliseconds).
`scroll`	Scroll to a DOM element using a CSS selector.
`click`	Simulate a click on an element.
`eval`	Execute custom JavaScript in the page context.
`screenshot`	Capture an image of a specific element or the viewport.

📋 Example Workflow

"workflow": [
  { "type": "wait", "duration": 1000 },
  { "type": "scroll", "selector": "footer" },
  { "type": "screenshot", "selector": "footer", "id": "footer-ss" },
  { "type": "screenshot", "selector": "viewport", "id": "viewport-ss" },
  { "type": "eval", "script": "console.log('hello world')" },
  { "type": "scroll", "selector": "main" }
]

📸 Screenshot Output

Each screenshot is returned in the screenshots object:

"screenshots": {
  "footer-ss": "https://cdn.example.com/screens/abc123.png",
  "viewport-ss": "https://cdn.example.com/screens/def456.png"
}

🧪 Eval Result

Custom JavaScript evaluation results are indexed by step number:

"evaluation": {
  "4": {
    "result": "",
    "error": ""
  }
}

🧰 Workflow Object Structure

type WorkflowStep = {
  type: "scroll" | "click" | "wait" | "eval" | "screenshot";
  selector?: string;
  duration?: number;
  script?: string;
  id?: string;
};

🧠 Tips

Use scroll + wait to reveal dynamic content.
Always provide id when capturing multiple screenshots.
Eval is useful for DOM manipulation or capturing in-page variables.

🍪 Custom Cookies & User-Agent

You can customize the HTTP request headers by supplying cookies and a custom User-Agent. This is helpful when dealing with authenticated content, geo-targeted pages, or bot detection mechanisms.

🧁 `cookies` Field

The cookies field allows you to send one or more cookies as part of the scrape request.

🔐 Cookie Object Structure

{
  name: string
  value: string
  path: string
  expires?: number
  httpOnly: boolean
  secure: boolean
  domain: string
  sameSite: 'Strict' | 'Lax' | 'None'
}

🧭 `userAgent` Field

Set a custom User-Agent string to simulate different browsers or devices.

📥 Example

{
  "url": "https://example.com/account",
  "userAgent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
  "cookies": [
    {
      "name": "session_id",
      "value": "abc123xyz",
      "domain": "example.com",
      "path": "/",
      "httpOnly": true,
      "secure": true,
      "sameSite": "Lax"
    }
  ]
}

🕷️ Scrape Endpoint