What is a Scraper Workflow?
A Scraper Workflow is a specialized workflow step that extracts data from web pages. It visits a URL, pulls out specific pieces of information (like titles, prices, or descriptions), and passes that data to the next step in your workflow.
This is perfect for research-based content workflows where you want to:
- Extract product information for comparison articles
- Gather search results to inform your content
- Pull data from competitor pages for analysis
- Collect statistics or facts from reference sites
- Flexible: Use the free scraper or ScrapingBee for advanced features
- Fast: Runs synchronously within your workflow
- Structured: Returns organized data your writing steps can use
How It Works
A scraper step is added to your multi-step workflow. Here's the flow:
Tell the scraper which page to visit. You can include your keyword in the URL.
Use CSS selectors to specify which parts of the page you want (titles, prices, links, etc.)
The scraper visits the page and pulls out the data you specified.
The extracted data (as structured JSON) is sent to your next workflow step.
You can add multiple scraper steps in one workflow to gather data from different websites. For example: Scraper 1 (Amazon) → Scraper 2 (Best Buy) → Writer. The writing step receives data from both scrapers.
Setting Up a Scraper Workflow
Step 1: Add a Workflow Step
- Create or edit a workflow
- Switch to the Workflow view and click + Add Step
- From the provider dropdown, select Web Scraper
- The scraper configuration panel will appear
Step 2: Configure the URL
Enter the URL you want to scrape. You can include {'{{keyword}}'} to dynamically
insert your job's keyword into the URL.
https://example.com/search?q={'{{keyword}}'}
For keyword "new homes for sale chicago":
→ https://example.com/search?q=new%20homes%20for%20sale%20chicago Keyword Modifiers
You can transform the keyword before inserting it into the URL:
| Modifier | Example | Result |
|---|---|---|
{'{{keyword}}'} | new homes for sale chicago | new homes for sale chicago |
{'{{keyword:slug}}'} | new homes for sale chicago | new-homes-for-sale-chicago |
{'{{keyword:first}}'} | new homes for sale chicago | new |
{'{{keyword:last}}'} | new homes for sale chicago | chicago |
{'{{keyword:word:2}}'} | new homes for sale chicago | homes |
{'{{keyword:words:2-4}}'} | new homes for sale chicago | homes for sale |
{'{{keyword:from:2}}'} | new homes for sale chicago | homes for sale chicago |
{'{{keyword:to:3}}'} | new homes for sale chicago | new homes for |
{'{{keyword:notlast:1}}'} | new homes for sale chicago | new homes for sale |
Scraper Settings
After adding a scraper step, you can configure the scraping engine and options in the Settings panel.
Scraper Provider
Choose which scraping engine to use:
| Provider | Description |
|---|---|
| Free Scraper | No API key required. Limited to 20 requests per hour. Best for simple, static HTML pages. |
| ScrapingBee | Requires a ScrapingBee API key (add in Settings). Unlimited requests, JavaScript rendering, proxy rotation, and CAPTCHA handling. |
ScrapingBee Options
When using ScrapingBee, you can enable additional options:
- JavaScript Rendering: Executes JavaScript on the page before extracting content. Required for sites that load content dynamically (React, Vue, AJAX).
- Proxy Type: Choose how requests are routed:
- None: Standard requests (fastest)
- Premium: Residential proxies for better success rates
- Stealth: Advanced anti-detection (automatically enables JS rendering)
- Own Proxy: Use your own proxy server
- Target site blocks requests or returns CAPTCHAs
- Content is loaded by JavaScript
- You need to scrape more than 20 pages per hour
- You need reliable, consistent results
Configuring Data Selectors
Data selectors tell the scraper what to extract from the page. Each selector has these options:
Selector Options
- Label: A name for this data (e.g., "product_titles", "prices")
- Selector Type: Choose a preset or enter a custom CSS selector
- Attribute: Extract an attribute (like href or src) instead of text
- Multiple: Extract all matches instead of just the first one
Preset Selectors
Publish Owl provides 6 preset selectors that work on most websites:
| Preset | What It Extracts |
|---|---|
title | Page title from title tag, h1, or meta tags |
content | Main content from article, main, or content divs |
description | Meta description from meta tags |
price | Prices from common price elements |
images | Image URLs from img tags |
links | Links from anchor tags |
How to Find CSS Selectors
CSS selectors are like addresses that tell the scraper where to find specific content on a page. Here's how to find them:
Using Browser Developer Tools
- Open the web page you want to scrape in Chrome, Firefox, or Edge
- Right-click on the element you want to extract (e.g., a product title)
- Click "Inspect" or "Inspect Element"
- The Developer Tools panel will open, highlighting the element's HTML
- Look at the element's tag, class, and ID attributes
Reading the HTML
When you inspect an element, you'll see something like:
<h2 class="product-title">MacBook Pro 16"</h2>
^ ^
tag class
From this, you can create the selector: .product-title
Common Selector Patterns
| HTML Element | CSS Selector |
|---|---|
<div class="price"> | .price |
<div id="main-content"> | #main-content |
<h2 class="product-title"> | h2.product-title or .product-title |
<a href="/product"> | a (with attribute: href) |
<div class="card"><span class="price"> | .card .price |
<div data-price="99"> | [data-price] |
.classname- Selects by class (use a dot before the name)#idname- Selects by ID (use a hash before the name)tagname- Selects by tag (no prefix needed).parent .child- Selects nested elements[attribute]- Selects by attribute
Extracting Attributes vs Text
By default, the scraper extracts the text content of an element. But sometimes you need to extract an attribute value instead (like a link URL or image source).
Common Attributes
| Attribute | Use Case | Example Element |
|---|---|---|
href | Extract link URLs | <a href="/product">... |
src | Extract image URLs | <img src="/image.jpg"> |
content | Extract meta tag values | <meta content="..."> |
data-* | Extract custom data | <div data-price="99"> |
alt | Extract image descriptions | <img alt="Product photo"> |
Example: Extracting Links
Selector: .product-card a Attribute: href Multiple: Yes HTML: <div class="product-card"> <a href="/products/laptop-1">Laptop 1</a> </div> <div class="product-card"> <a href="/products/laptop-2">Laptop 2</a> </div> Result: ["/products/laptop-1", "/products/laptop-2"]
Single vs Multiple Matches
The Multiple checkbox controls whether you get one result or all matching results.
Selector: .product-title Result: "MacBook Pro 16\""
Selector: .product-title Result: [ "MacBook Pro 16\"", "MacBook Air 15\"", "Mac Mini M2" ]
- Extracting product listings from a search page
- Getting all prices on a comparison page
- Collecting multiple links or images
- Scraping any repeating elements
Example Configurations
Example 1: Product Research
Extract product information from an e-commerce search page.
URL: https://shop.example.com/search?q={'{{keyword}}'}
Selectors:
1. Label: product_names
Selector: .product-card h3
Multiple: Yes
2. Label: prices
Selector: .product-card .price
Multiple: Yes
3. Label: product_links
Selector: .product-card a
Attribute: href
Multiple: Yes Example 2: Article Research
Extract article titles and summaries from a news site.
URL: https://news.example.com/search/{'{{keyword:slug}}'}
Selectors:
1. Label: headlines
Selector: .article-card h2
Multiple: Yes
2. Label: summaries
Selector: .article-card .excerpt
Multiple: Yes
3. Label: dates
Selector: .article-card time
Multiple: Yes Example 3: Single Page Data
Extract specific data points from a single product page.
URL: https://example.com/product/{'{{keyword:slug}}'}
Selectors:
1. Label: title
Selector: h1.product-title
Multiple: No
2. Label: price
Selector: .current-price
Multiple: No
3. Label: description
Selector: .product-description
Multiple: No
4. Label: specs
Selector: .spec-list li
Multiple: Yes Using Scraped Data in Your Workflow
The scraper outputs structured JSON that automatically passes to the next step. Here's what it looks like:
{`{
"url": "https://shop.example.com/search?q=best%20laptops",
"scraped": {
"product_names": ["MacBook Pro", "Dell XPS 15", "ThinkPad X1"],
"prices": ["$2,499", "$1,899", "$1,649"],
"product_links": ["/products/macbook", "/products/xps", "/products/thinkpad"]
}
}`} In Your Next Step's Prompt
The writing step receives this data automatically. You can reference it in your prompt:
Write a comparison article about the products in the research data. Include the product names, prices, and create a comparison table. The research data is provided above.
Debugging Tips
Selector Not Finding Anything?
- Double-check the class name spelling (they're case-sensitive)
- Make sure to use
.before class names - Try a simpler selector first (just the class, without nesting)
- The element might be loaded by JavaScript (see limitations below)
Getting Too Much Content?
- Make your selector more specific (e.g.,
.card .titleinstead of.title) - Use
Multiple: Noif you only need the first match
Enable Raw HTML (Temporarily)
Check "Include raw HTML" to see the actual page HTML. This helps you find the right selectors. Disable it after debugging - large HTML can cause issues.
Limitations
The free scraper cannot extract content that is:
- Loaded by JavaScript after the page loads (AJAX content)
- Behind login pages or paywalls
- Rendered by client-side frameworks (React, Vue, etc.)
- Loaded after scrolling or clicking
If your target site loads content with JavaScript, use ScrapingBee with JavaScript Rendering enabled. This executes JavaScript on the page before extracting content, handling most dynamic sites.
For sites with strong anti-bot protection, enable Stealth Proxy for advanced detection evasion.
Alternative: Anchor Browser
For complex scraping tasks (e.g., sites requiring login, multi-step interactions, or scrolling), consider using the Anchor Browser (b0) step which runs a full browser.
How to Tell If a Site Needs JavaScript
- Right-click the page and select "View Page Source"
- Search for the content you want to extract
- If you can find it in the source, the free scraper will work
- If you can't find it, the content is loaded by JavaScript—use ScrapingBee with JavaScript Rendering
Best Practices
1. Start Simple
Begin with preset selectors. Only use custom CSS selectors when presets don't work.
2. Test Your Selectors
Run a test job with one keyword to verify your selectors work before scaling up.
3. Use Descriptive Labels
Name your selectors clearly (e.g., "product_prices" not "data1") so the writing step understands the data.
4. Be Respectful
Don't scrape sites excessively. Check robots.txt and terms of service before scraping.
5. Position in Workflow
Place scraper steps early in your workflow so writing steps have access to the data. A typical flow: Scraper → Writer → Editor.
6. Use Multiple Scrapers
You can add multiple scraper steps in one workflow to gather data from different sources. Each scraper's output is passed along, giving your writing step richer context.
Scraping in News Discovery
The News Discovery feature also uses scraping when "Fetch full article content" is enabled. When News Discovery finds stories through web searches or RSS feeds, it can optionally scrape the full content from each source URL instead of relying on search result summaries.
How it Works
- News Discovery searches for stories using AI-powered web search or monitors RSS feeds
- When "Fetch full article content" is enabled, it scrapes the full text from discovered URLs
- The scraped content is passed to your workflow steps instead of just the search summary
- You can limit how many URLs are scraped per discovery run (default: 3)
Scraper Options for News Discovery
- Scraper Provider: Free or ScrapingBee (same as workflow scrapers)
- JavaScript Rendering: Enable for sites that load content dynamically
- Proxy Settings: Premium or Stealth proxy for better success rates
- Max URLs to Scrape: Limit how many citation URLs to scrape per search
Per-URL Scraper Overrides
When using URL to Scrape mode (select it from the Keyword column dropdown), you can customize scraper settings for individual URLs. This is useful when scraping multiple sites with different page structures.
- Scraping multiple different websites in one workflow (each has different selectors)
- One site needs JavaScript rendering but others don't
- Testing different selectors for a specific URL
How to Configure Per-URL Overrides
- Add a Web Scraper workflow step to your workflow
- Switch to URL to Scrape mode (click the "Keyword" column header)
- In the Keywords table, you'll see a Selectors column
- Click the button in the Selectors column for any URL row
- Configure custom selectors and settings for that specific URL
Available Override Options
- Custom Selectors: Different CSS selectors for this specific URL
- Scraper Provider: Free, ScrapingBee, or inherit from workflow
- JavaScript Rendering: Enable/disable per URL
- Proxy Settings: None, Premium, Stealth, or Own Proxy per URL
Example Use Case
Real estate content scraping Zillow, Redfin, and Realtor.com - each site has different CSS selectors for prices, addresses, and listing details. Per-URL overrides let you configure site-specific selectors while using a single workflow.
URL 1: https://zillow.com/chicago Selectors: .list-card-price, .list-card-addr URL 2: https://redfin.com/city/chicago Selectors: .HomeCard-price, .HomeCard-address URL 3: https://realtor.com/realestateandhomes-search/chicago Selectors: [data-testid="card-price"], [data-testid="card-address"]