Web Scraper Workflow

What is a Scraper Workflow?

A Scraper Workflow is a specialized workflow step that extracts data from web pages. It visits a URL, pulls out specific pieces of information (like titles, prices, or descriptions), and passes that data to the next step in your workflow.

This is perfect for research-based content workflows where you want to:

  • Extract product information for comparison articles
  • Gather search results to inform your content
  • Pull data from competitor pages for analysis
  • Collect statistics or facts from reference sites
Key Benefits:
  • Flexible: Use the free scraper or ScrapingBee for advanced features
  • Fast: Runs synchronously within your workflow
  • Structured: Returns organized data your writing steps can use

How It Works

A scraper step is added to your multi-step workflow. Here's the flow:

Step 1: You provide a URL

Tell the scraper which page to visit. You can include your keyword in the URL.

Step 2: You define what to extract

Use CSS selectors to specify which parts of the page you want (titles, prices, links, etc.)

Step 3: Scraper fetches and extracts

The scraper visits the page and pulls out the data you specified.

Step 4: Data passes to next step

The extracted data (as structured JSON) is sent to your next workflow step.

Tip: Multiple Scrapers

You can add multiple scraper steps in one workflow to gather data from different websites. For example: Scraper 1 (Amazon) → Scraper 2 (Best Buy) → Writer. The writing step receives data from both scrapers.

Setting Up a Scraper Workflow

Step 1: Add a Workflow Step

  1. Create or edit a workflow
  2. Switch to the Workflow view and click + Add Step
  3. From the provider dropdown, select Web Scraper
  4. The scraper configuration panel will appear

Step 2: Configure the URL

Enter the URL you want to scrape. You can include {'{{keyword}}'} to dynamically insert your job's keyword into the URL.

https://example.com/search?q={'{{keyword}}'}

For keyword "new homes for sale chicago":
→ https://example.com/search?q=new%20homes%20for%20sale%20chicago

Keyword Modifiers

You can transform the keyword before inserting it into the URL:

Modifier Example Result
{'{{keyword}}'} new homes for sale chicago new homes for sale chicago
{'{{keyword:slug}}'} new homes for sale chicago new-homes-for-sale-chicago
{'{{keyword:first}}'} new homes for sale chicago new
{'{{keyword:last}}'} new homes for sale chicago chicago
{'{{keyword:word:2}}'} new homes for sale chicago homes
{'{{keyword:words:2-4}}'} new homes for sale chicago homes for sale
{'{{keyword:from:2}}'} new homes for sale chicago homes for sale chicago
{'{{keyword:to:3}}'} new homes for sale chicago new homes for
{'{{keyword:notlast:1}}'} new homes for sale chicago new homes for sale

Scraper Settings

After adding a scraper step, you can configure the scraping engine and options in the Settings panel.

Scraper Provider

Choose which scraping engine to use:

Provider Description
Free Scraper No API key required. Limited to 20 requests per hour. Best for simple, static HTML pages.
ScrapingBee Requires a ScrapingBee API key (add in Settings). Unlimited requests, JavaScript rendering, proxy rotation, and CAPTCHA handling.

ScrapingBee Options

When using ScrapingBee, you can enable additional options:

  • JavaScript Rendering: Executes JavaScript on the page before extracting content. Required for sites that load content dynamically (React, Vue, AJAX).
  • Proxy Type: Choose how requests are routed:
    • None: Standard requests (fastest)
    • Premium: Residential proxies for better success rates
    • Stealth: Advanced anti-detection (automatically enables JS rendering)
    • Own Proxy: Use your own proxy server
When to Use ScrapingBee
  • Target site blocks requests or returns CAPTCHAs
  • Content is loaded by JavaScript
  • You need to scrape more than 20 pages per hour
  • You need reliable, consistent results

Configuring Data Selectors

Data selectors tell the scraper what to extract from the page. Each selector has these options:

Selector Options

  • Label: A name for this data (e.g., "product_titles", "prices")
  • Selector Type: Choose a preset or enter a custom CSS selector
  • Attribute: Extract an attribute (like href or src) instead of text
  • Multiple: Extract all matches instead of just the first one

Preset Selectors

Publish Owl provides 6 preset selectors that work on most websites:

Preset What It Extracts
title Page title from title tag, h1, or meta tags
content Main content from article, main, or content divs
description Meta description from meta tags
price Prices from common price elements
images Image URLs from img tags
links Links from anchor tags

How to Find CSS Selectors

CSS selectors are like addresses that tell the scraper where to find specific content on a page. Here's how to find them:

Using Browser Developer Tools

  1. Open the web page you want to scrape in Chrome, Firefox, or Edge
  2. Right-click on the element you want to extract (e.g., a product title)
  3. Click "Inspect" or "Inspect Element"
  4. The Developer Tools panel will open, highlighting the element's HTML
  5. Look at the element's tag, class, and ID attributes

Reading the HTML

When you inspect an element, you'll see something like:

<h2 class="product-title">MacBook Pro 16"</h2>
     ^           ^
    tag        class

From this, you can create the selector: .product-title

Common Selector Patterns

HTML Element CSS Selector
<div class="price"> .price
<div id="main-content"> #main-content
<h2 class="product-title"> h2.product-title or .product-title
<a href="/product"> a (with attribute: href)
<div class="card"><span class="price"> .card .price
<div data-price="99"> [data-price]
Quick Reference:
  • .classname - Selects by class (use a dot before the name)
  • #idname - Selects by ID (use a hash before the name)
  • tagname - Selects by tag (no prefix needed)
  • .parent .child - Selects nested elements
  • [attribute] - Selects by attribute

Extracting Attributes vs Text

By default, the scraper extracts the text content of an element. But sometimes you need to extract an attribute value instead (like a link URL or image source).

Common Attributes

Attribute Use Case Example Element
href Extract link URLs <a href="/product">...
src Extract image URLs <img src="/image.jpg">
content Extract meta tag values <meta content="...">
data-* Extract custom data <div data-price="99">
alt Extract image descriptions <img alt="Product photo">

Example: Extracting Links

Selector: .product-card a
Attribute: href
Multiple: Yes

HTML:
<div class="product-card">
  <a href="/products/laptop-1">Laptop 1</a>
</div>
<div class="product-card">
  <a href="/products/laptop-2">Laptop 2</a>
</div>

Result: ["/products/laptop-1", "/products/laptop-2"]

Single vs Multiple Matches

The Multiple checkbox controls whether you get one result or all matching results.

Multiple OFF (first match only):
Selector: .product-title

Result: "MacBook Pro 16\""
Multiple ON (all matches):
Selector: .product-title

Result: [
  "MacBook Pro 16\"",
  "MacBook Air 15\"",
  "Mac Mini M2"
]
When to use Multiple:
  • Extracting product listings from a search page
  • Getting all prices on a comparison page
  • Collecting multiple links or images
  • Scraping any repeating elements

Example Configurations

Example 1: Product Research

Extract product information from an e-commerce search page.

URL: https://shop.example.com/search?q={'{{keyword}}'}

Selectors:
  1. Label: product_names
     Selector: .product-card h3
     Multiple: Yes

  2. Label: prices
     Selector: .product-card .price
     Multiple: Yes

  3. Label: product_links
     Selector: .product-card a
     Attribute: href
     Multiple: Yes

Example 2: Article Research

Extract article titles and summaries from a news site.

URL: https://news.example.com/search/{'{{keyword:slug}}'}

Selectors:
  1. Label: headlines
     Selector: .article-card h2
     Multiple: Yes

  2. Label: summaries
     Selector: .article-card .excerpt
     Multiple: Yes

  3. Label: dates
     Selector: .article-card time
     Multiple: Yes

Example 3: Single Page Data

Extract specific data points from a single product page.

URL: https://example.com/product/{'{{keyword:slug}}'}

Selectors:
  1. Label: title
     Selector: h1.product-title
     Multiple: No

  2. Label: price
     Selector: .current-price
     Multiple: No

  3. Label: description
     Selector: .product-description
     Multiple: No

  4. Label: specs
     Selector: .spec-list li
     Multiple: Yes

Using Scraped Data in Your Workflow

The scraper outputs structured JSON that automatically passes to the next step. Here's what it looks like:

{`{
  "url": "https://shop.example.com/search?q=best%20laptops",
  "scraped": {
    "product_names": ["MacBook Pro", "Dell XPS 15", "ThinkPad X1"],
    "prices": ["$2,499", "$1,899", "$1,649"],
    "product_links": ["/products/macbook", "/products/xps", "/products/thinkpad"]
  }
}`}

In Your Next Step's Prompt

The writing step receives this data automatically. You can reference it in your prompt:

Write a comparison article about the products in the research data.
Include the product names, prices, and create a comparison table.
The research data is provided above.

Debugging Tips

Selector Not Finding Anything?

  • Double-check the class name spelling (they're case-sensitive)
  • Make sure to use . before class names
  • Try a simpler selector first (just the class, without nesting)
  • The element might be loaded by JavaScript (see limitations below)

Getting Too Much Content?

  • Make your selector more specific (e.g., .card .title instead of .title)
  • Use Multiple: No if you only need the first match

Enable Raw HTML (Temporarily)

Check "Include raw HTML" to see the actual page HTML. This helps you find the right selectors. Disable it after debugging - large HTML can cause issues.

Limitations

Free Scraper: Static HTML Only

The free scraper cannot extract content that is:

  • Loaded by JavaScript after the page loads (AJAX content)
  • Behind login pages or paywalls
  • Rendered by client-side frameworks (React, Vue, etc.)
  • Loaded after scrolling or clicking
ScrapingBee: JavaScript Support

If your target site loads content with JavaScript, use ScrapingBee with JavaScript Rendering enabled. This executes JavaScript on the page before extracting content, handling most dynamic sites.

For sites with strong anti-bot protection, enable Stealth Proxy for advanced detection evasion.

Alternative: Anchor Browser

For complex scraping tasks (e.g., sites requiring login, multi-step interactions, or scrolling), consider using the Anchor Browser (b0) step which runs a full browser.

How to Tell If a Site Needs JavaScript

  1. Right-click the page and select "View Page Source"
  2. Search for the content you want to extract
  3. If you can find it in the source, the free scraper will work
  4. If you can't find it, the content is loaded by JavaScript—use ScrapingBee with JavaScript Rendering

Best Practices

1. Start Simple

Begin with preset selectors. Only use custom CSS selectors when presets don't work.

2. Test Your Selectors

Run a test job with one keyword to verify your selectors work before scaling up.

3. Use Descriptive Labels

Name your selectors clearly (e.g., "product_prices" not "data1") so the writing step understands the data.

4. Be Respectful

Don't scrape sites excessively. Check robots.txt and terms of service before scraping.

5. Position in Workflow

Place scraper steps early in your workflow so writing steps have access to the data. A typical flow: Scraper → Writer → Editor.

6. Use Multiple Scrapers

You can add multiple scraper steps in one workflow to gather data from different sources. Each scraper's output is passed along, giving your writing step richer context.

Scraping in News Discovery

The News Discovery feature also uses scraping when "Fetch full article content" is enabled. When News Discovery finds stories through web searches or RSS feeds, it can optionally scrape the full content from each source URL instead of relying on search result summaries.

How it Works

  • News Discovery searches for stories using AI-powered web search or monitors RSS feeds
  • When "Fetch full article content" is enabled, it scrapes the full text from discovered URLs
  • The scraped content is passed to your workflow steps instead of just the search summary
  • You can limit how many URLs are scraped per discovery run (default: 3)

Scraper Options for News Discovery

  • Scraper Provider: Free or ScrapingBee (same as workflow scrapers)
  • JavaScript Rendering: Enable for sites that load content dynamically
  • Proxy Settings: Premium or Stealth proxy for better success rates
  • Max URLs to Scrape: Limit how many citation URLs to scrape per search
Tip: Enable full article scraping to give your workflow steps more context about each news story. This results in higher quality, more accurate articles compared to using only search summaries.

Per-URL Scraper Overrides

When using URL to Scrape mode (select it from the Keyword column dropdown), you can customize scraper settings for individual URLs. This is useful when scraping multiple sites with different page structures.

When to Use Per-URL Overrides:
  • Scraping multiple different websites in one workflow (each has different selectors)
  • One site needs JavaScript rendering but others don't
  • Testing different selectors for a specific URL

How to Configure Per-URL Overrides

  1. Add a Web Scraper workflow step to your workflow
  2. Switch to URL to Scrape mode (click the "Keyword" column header)
  3. In the Keywords table, you'll see a Selectors column
  4. Click the button in the Selectors column for any URL row
  5. Configure custom selectors and settings for that specific URL

Available Override Options

  • Custom Selectors: Different CSS selectors for this specific URL
  • Scraper Provider: Free, ScrapingBee, or inherit from workflow
  • JavaScript Rendering: Enable/disable per URL
  • Proxy Settings: None, Premium, Stealth, or Own Proxy per URL

Example Use Case

Real estate content scraping Zillow, Redfin, and Realtor.com - each site has different CSS selectors for prices, addresses, and listing details. Per-URL overrides let you configure site-specific selectors while using a single workflow.

URL 1: https://zillow.com/chicago
  Selectors: .list-card-price, .list-card-addr

URL 2: https://redfin.com/city/chicago
  Selectors: .HomeCard-price, .HomeCard-address

URL 3: https://realtor.com/realestateandhomes-search/chicago
  Selectors: [data-testid="card-price"], [data-testid="card-address"]
Was this helpful?