crawlio

๐Ÿ•ท๏ธ Scrape Endpoint

Initiate a one-time scrape of a specific webpage. This endpoint allows fine-grained control over what content is retrieved and how it's formatted.

๐Ÿงฐ Using with SDKs

Prefer code over curl? Crawlio offers official SDKs for seamless integration with your stack:

๐Ÿ“– View full usage docs: ๐Ÿ‘‰ Node.js SDK Docs ๐Ÿ‘‰ Python SDK Docs

We are working on an extensive documentation on our SDKs. Thanks for your cooperation!

Cost

NameCostType
Scrape1Scrape

Using with REST API

๐Ÿงช POST /scrape

curl -X POST https://crawlio.xyz/api/scrape \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/blog/article",
    "exclude": ["footer", ".ads"],
    "includeOnly": [".article-body"],
    "markdown": true,
    "returnUrls": true
  }'

๐Ÿ“ฅ Request

Endpoint:

POST https://crawlio.xyz/api/scrape

Headers:

Authorization: Bearer YOUR_API_KEY Content-Type: application/json

Request Body Parameters:

FieldTypeRequiredDescription
urlstringโœ… YesThe full URL of the webpage to be scraped.
excludearray of stringsโŒ NoCSS selectors of elements to remove from the scraped content.
includeOnlyarray of stringsโŒ NoCSS selectors to restrict scraping to only these elements.
markdownbooleanโœ… YesWhether to return the content as Markdown in addition to raw HTML.
returnUrlsbooleanโœ… YesWhether to return a list of all URLs found on the page.
workflowarray of objectsโŒ NoA sequence of browser actions to execute before scraping begins.
userAgentstringโŒ NoUser-Agent string to simulate different browsers or devices.
cookiesarray of objectsโŒ Nocookies field allows you to send one or more cookies as part of the scrape request.

๐Ÿงพ Example Request

{
  "url": "https://example.com/blog/article",
  "exclude": ["footer", ".ads"],
  "includeOnly": [".article-body"],
  "markdown": true,
  "returnUrls": true
}

๐Ÿ“ค Response

On success, you'll receive a 200 OK response with the following JSON structure:

FieldTypeDescription
jobIdstringUnique ID of the scraping job.
urlstringThe target URL that was scraped.
urlsarray of stringsList of URLs found on the page (if returnUrls is true).
htmlstringThe full HTML content after processing.
markdownstringMarkdown version of the content (if markdown is true).
metaobjectMetadata such as generator, viewport size, etc.
baseUrlstringBase URL used for relative links.
screenshotsobjectCaptured screenshots if defined in workflow steps.
evaluationobjectResults or errors from custom eval scripts in workflow.

๐Ÿ“ฆ Example Response

{
  "jobId": "abc123",
  "url": "https://example.com/blog/article",
  "urls": [
    "https://example.com/about",
    "https://example.com/contact"
  ],
  "html": "<html>...</html>",
  "markdown": "# Article Title\n\nContent goes here...",
  "meta": {
    "generator": "CrawlioBot v1.2",
    "viewport": "1280x720"
  },
  "baseUrl": "https://example.com"
}

What and Why?

The Scrape feature is designed for extracting content from a single webpage in a structured, flexible way. It's perfect when you want to capture data from a specific URL without crawling additional pages.

You can fine-tune the scrape with options like:

  • Include Only: Limit output to specific sections of the page (e.g., just the article body).
  • Exclude Elements: Remove unwanted parts such as ads, navigation bars, or footers.
  • Markdown Output: Convert the content into clean, readable Markdown โ€” great for storage, readability, or further processing.
  • Return URLs: Get a list of all internal/external links found on the page for further crawling or indexing.

This feature is ideal for tasks like:

  • Blog or news article extraction
  • Product detail scraping
  • Content archiving
  • Clean page capture for summarization or AI processing

You can initiate the scrape using the /scrape endpoint through any SDK or directly via the REST API.


โš™๏ธ Workflow Feature (Advanced Page Interaction)

The workflow field enables a powerful way to control browser behavior before scraping. Define a sequence of interactions โ€” such as scrolling, clicking, waiting, or evaluating JavaScript โ€” to ensure the page is in the right state before content extraction begins.

๐Ÿง  Supported Actions

TypeDescription
waitPause execution for a specified time (in milliseconds).
scrollScroll to a DOM element using a CSS selector.
clickSimulate a click on an element.
evalExecute custom JavaScript in the page context.
screenshotCapture an image of a specific element or the viewport.

๐Ÿ“‹ Example Workflow

"workflow": [
  { "type": "wait", "duration": 1000 },
  { "type": "scroll", "selector": "footer" },
  { "type": "screenshot", "selector": "footer", "id": "footer-ss" },
  { "type": "screenshot", "selector": "viewport", "id": "viewport-ss" },
  { "type": "eval", "script": "console.log('hello world')" },
  { "type": "scroll", "selector": "main" }
]

๐Ÿ“ธ Screenshot Output

Each screenshot is returned in the screenshots object:

"screenshots": {
  "footer-ss": "https://cdn.example.com/screens/abc123.png",
  "viewport-ss": "https://cdn.example.com/screens/def456.png"
}

๐Ÿงช Eval Result

Custom JavaScript evaluation results are indexed by step number:

"evaluation": {
  "4": {
    "result": "",
    "error": ""
  }
}

๐Ÿงฐ Workflow Object Structure

type WorkflowStep = {
  type: "scroll" | "click" | "wait" | "eval" | "screenshot";
  selector?: string;
  duration?: number;
  script?: string;
  id?: string;
};

๐Ÿง  Tips

  • Use scroll + wait to reveal dynamic content.
  • Always provide id when capturing multiple screenshots.
  • Eval is useful for DOM manipulation or capturing in-page variables.

๐Ÿช Custom Cookies & User-Agent

You can customize the HTTP request headers by supplying cookies and a custom User-Agent. This is helpful when dealing with authenticated content, geo-targeted pages, or bot detection mechanisms.

๐Ÿง cookies Field

The cookies field allows you to send one or more cookies as part of the scrape request.

๐Ÿ” Cookie Object Structure

{
  name: string
  value: string
  path: string
  expires?: number
  httpOnly: boolean
  secure: boolean
  domain: string
  sameSite: 'Strict' | 'Lax' | 'None'
}

๐Ÿงญ userAgent Field

Set a custom User-Agent string to simulate different browsers or devices.

๐Ÿ“ฅ Example

{
  "url": "https://example.com/account",
  "userAgent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
  "cookies": [
    {
      "name": "session_id",
      "value": "abc123xyz",
      "domain": "example.com",
      "path": "/",
      "httpOnly": true,
      "secure": true,
      "sameSite": "Lax"
    }
  ]
}