๐ท๏ธ Scrape Endpoint
Initiate a one-time scrape of a specific webpage. This endpoint allows fine-grained control over what content is retrieved and how it's formatted.
๐งฐ Using with SDKs
Prefer code over curl? Crawlio offers official SDKs for seamless integration with your stack:
- Node.js SDK (npm) โ Perfect for backend automation, agents, and JS projects.
- Python SDK (PyPI) โ Ideal for data science, AI/ML workflows, and scripting.
๐ View full usage docs: ๐ Node.js SDK Docs ๐ Python SDK Docs
We are working on an extensive documentation on our SDKs. Thanks for your cooperation!
Cost
Name | Cost | Type |
---|---|---|
Scrape | 1 | Scrape |
Using with REST API
๐งช POST /scrape
๐ฅ Request
Endpoint:
POST https://crawlio.xyz/api/scrape
Headers:
Authorization: Bearer YOUR_API_KEY
Content-Type: application/json
Request Body Parameters:
Field | Type | Required | Description |
---|---|---|---|
url | string | โ Yes | The full URL of the webpage to be scraped. |
exclude | array of strings | โ No | CSS selectors of elements to remove from the scraped content. |
includeOnly | array of strings | โ No | CSS selectors to restrict scraping to only these elements. |
markdown | boolean | โ Yes | Whether to return the content as Markdown in addition to raw HTML. |
returnUrls | boolean | โ Yes | Whether to return a list of all URLs found on the page. |
workflow | array of objects | โ No | A sequence of browser actions to execute before scraping begins. |
userAgent | string | โ No | User-Agent string to simulate different browsers or devices. |
cookies | array of objects | โ No | cookies field allows you to send one or more cookies as part of the scrape request. |
๐งพ Example Request
๐ค Response
On success, you'll receive a 200 OK response with the following JSON structure:
Field | Type | Description |
---|---|---|
jobId | string | Unique ID of the scraping job. |
url | string | The target URL that was scraped. |
urls | array of strings | List of URLs found on the page (if returnUrls is true ). |
html | string | The full HTML content after processing. |
markdown | string | Markdown version of the content (if markdown is true ). |
meta | object | Metadata such as generator, viewport size, etc. |
baseUrl | string | Base URL used for relative links. |
screenshots | object | Captured screenshots if defined in workflow steps. |
evaluation | object | Results or errors from custom eval scripts in workflow. |
๐ฆ Example Response
What and Why?
The Scrape feature is designed for extracting content from a single webpage in a structured, flexible way. It's perfect when you want to capture data from a specific URL without crawling additional pages.
You can fine-tune the scrape with options like:
- Include Only: Limit output to specific sections of the page (e.g., just the article body).
- Exclude Elements: Remove unwanted parts such as ads, navigation bars, or footers.
- Markdown Output: Convert the content into clean, readable Markdown โ great for storage, readability, or further processing.
- Return URLs: Get a list of all internal/external links found on the page for further crawling or indexing.
This feature is ideal for tasks like:
- Blog or news article extraction
- Product detail scraping
- Content archiving
- Clean page capture for summarization or AI processing
You can initiate the scrape using the /scrape
endpoint through any SDK or directly via the REST API.
โ๏ธ Workflow Feature (Advanced Page Interaction)
The workflow
field enables a powerful way to control browser behavior before scraping. Define a sequence of interactions โ such as scrolling, clicking, waiting, or evaluating JavaScript โ to ensure the page is in the right state before content extraction begins.
๐ง Supported Actions
Type | Description |
---|---|
wait | Pause execution for a specified time (in milliseconds). |
scroll | Scroll to a DOM element using a CSS selector. |
click | Simulate a click on an element. |
eval | Execute custom JavaScript in the page context. |
screenshot | Capture an image of a specific element or the viewport. |
๐ Example Workflow
๐ธ Screenshot Output
Each screenshot is returned in the screenshots
object:
๐งช Eval Result
Custom JavaScript evaluation results are indexed by step number:
๐งฐ Workflow Object Structure
๐ง Tips
- Use
scroll
+wait
to reveal dynamic content. - Always provide
id
when capturing multiple screenshots. - Eval is useful for DOM manipulation or capturing in-page variables.
๐ช Custom Cookies & User-Agent
You can customize the HTTP request headers by supplying cookies and a custom User-Agent. This is helpful when dealing with authenticated content, geo-targeted pages, or bot detection mechanisms.
๐ง cookies
Field
The cookies
field allows you to send one or more cookies as part of the scrape request.
๐ Cookie Object Structure
๐งญ userAgent
Field
Set a custom User-Agent
string to simulate different browsers or devices.
๐ฅ Example
๐งฉ Features
Crawlio offers a flexible set of high-level features to help you extract structured content from websites at scale. Whether you're scraping individual pages, crawling entire domains, or running batch jobs โ Crawlio is designed to give you full control over your data extraction workflows.
๐ฆ Batch Scrape
Scrape multiple webpages in a single batch request. This endpoint is ideal for bulk extraction jobs where you need to process multiple URLs with shared options.