BeautifulSoup vs Scrapy: Practical Comparison in a Weather Scraper

Introduction

When you build a web scraper, choosing between BeautifulSoup and Scrapy often feels like a choice between simplicity and power. BeautifulSoup is lightweight and intuitive, perfect for quick tasks. Scrapy is a full-featured framework designed for large-scale, production-ready scraping. So how do they stack up in a real-world project?

In this article, I’ll explore this question through a BBC Weather scraper that extracts 14-day forecasts with hourly data, which is around 330 data points per location. The project implements both engines using a dual-architecture approach, so we can compare their performance directly while sharing core parsing logic.

This comparison is especially helpful if you’re building a pet project or a single-purpose scraper and wondering whether Scrapy is overkill. The goal isn’t just to see which is faster but to understand when each tool actually makes sense. Spoiler: for most single-location scraping tasks, BeautifulSoup is more than enough. Scrapy’s advantages appear when you need to scale.View the complete project here.

The Challenge: Extracting Weather Data from BBC

BBC Weather presents an interesting scraping challenge. The site relies heavily on JavaScript rendering, so simple HTTP requests won’t work – you need a real browser. Playwright handles this perfectly, loading the page and executing all scripts.

But here’s the key discovery: BBC Weather doesn’t render forecast data into HTML elements. Instead, all 14 days of forecasts are embedded as JSON directly inside a <script> tag. This means we don’t need complex CSS selectors or fragile HTML parsing. We just need to find the right script tag and extract the JSON.

The JSON structure looks like this:

{
   "options": {
 	"location_id": "2643743",
 	"day": "none",
 	"locale": "en"
   },
   "data": {
 	"forecasts": [
   	{
     	"localDate": "2026-02-05",
     	"timeslot": "14:00",
     	"temperatureC": 8,
     	"windSpeedKph": 24,
     	"humidity": 76,
     	"pressure": 1015
     	// ... more fields
   	}
   	// ... ~330 hourly reports
 	]
   }
 }

{
   "options": {
 	"location_id": "2643743",
 	"day": "none",
 	"locale": "en"
   },
   "data": {
 	"forecasts": [
   	{
     	"localDate": "2026-02-05",
     	"timeslot": "14:00",
     	"temperatureC": 8,
     	"windSpeedKph": 24,
     	"humidity": 76,
     	"pressure": 1015
     	// ... more fields
   	}
   	// ... ~330 hourly reports
 	]
   }
 }

Because the JSON is embedded inside JavaScript (not as a clean JSON response), we can’t simply call json.loads(). We first need to extract the raw JSON string safely. Here’s the extraction algorithm:

    def _extract_json_from_script(self, script_content: str) -> Optional[str]:
    # Find JSON start by looking for characteristic patterns
        start_patterns = ['{"options":', '{"data":']
        start_idx = -1

        for pattern in start_patterns:
            idx = script_content.find(pattern)
            if idx >= 0:
                start_idx = idx
                break

        if start_idx < 0:
            return None
    # Use bracket matching to find JSON end
        brace_count = 0
        in_string = False
        escape = False
        end_idx = start_idx

        for i in range(start_idx, len(script_content)):
            char = script_content[i]
    # Handle escape sequences (e.g., \inside strings )
            if escape:
                escape = False
                continue
    # Track whether we are inside a string
            if char == "\\":
                escape = True
                continue

            if char == '"':
                in_string = not in_string
                continue
    # Count braces only outside strings
            if not in_string:
                if char == "{":
                    brace_count += 1
                elif char == "}":
                    brace_count -= 1
                    if brace_count == 0:
                        end_idx = i + 1
                        break

        if end_idx > start_idx and brace_count == 0:
            return script_content[start_idx:end_idx]

        return None

    def _extract_json_from_script(self, script_content: str) -> Optional[str]:
    # Find JSON start by looking for characteristic patterns
        start_patterns = ['{"options":', '{"data":']
        start_idx = -1

        for pattern in start_patterns:
            idx = script_content.find(pattern)
            if idx >= 0:
                start_idx = idx
                break

        if start_idx < 0:
            return None
    # Use bracket matching to find JSON end
        brace_count = 0
        in_string = False
        escape = False
        end_idx = start_idx

        for i in range(start_idx, len(script_content)):
            char = script_content[i]
    # Handle escape sequences (e.g., \inside strings )
            if escape:
                escape = False
                continue
    # Track whether we are inside a string
            if char == "\\":
                escape = True
                continue

            if char == '"':
                in_string = not in_string
                continue
    # Count braces only outside strings
            if not in_string:
                if char == "{":
                    brace_count += 1
                elif char == "}":
                    brace_count -= 1
                    if brace_count == 0:
                        end_idx = i + 1
                        break

        if end_idx > start_idx and brace_count == 0:
            return script_content[start_idx:end_idx]

        return None

This approach is robust and resistant to CSS/HTML changes – as long as BBC keeps the JSON structure, our scraper continues working. Both BeautifulSoup and Scrapy implementations use this exact same parser, which brings us to the project’s architecture.

Dual-Engine Architecture

The project’s architecture is built around a single guiding principle: keep the scraping engine separate from the parsing logic. Since both BeautifulSoup and Scrapy need to extract the same JSON from the same HTML, they share the exact same BBCWeatherParser class. Each engine is responsible only for fetching the page content, while the shared parser handles parsing and data normalization. This eliminates code duplication and ensures consistent data extraction regardless of which engine you choose.

Engine selection is handled via a factory pattern. Both implementations inherit from a shared BaseScraper abstract class, which defines the common interface and keeps the factory logic clean. Instead of hardcoding which scraper to use, the CLI accepts an –engine flag that creates the appropriate scraper instance at runtime:

def create_scraper(
 	engine: Literal["bs4", "scrapy"] = "bs4",
 	storage_format: str = "json",
     output_filename: Optional[str] = None,
 ) -> BaseScraper:

 	if engine == "bs4":
     	return BBCWeatherScraper()
 	elif engine == "scrapy":
     	return ScrapyWeatherScraper(
             storage_format=storage_format,
             output_filename=output_filename
     	)
 	else:
     	raise ValueError(
         	f"Unknown scraper engine: {engine}. Supported engines: 'bs4', 'scrapy'"
     	)

def create_scraper(
 	engine: Literal["bs4", "scrapy"] = "bs4",
 	storage_format: str = "json",
     output_filename: Optional[str] = None,
 ) -> BaseScraper:

 	if engine == "bs4":
     	return BBCWeatherScraper()
 	elif engine == "scrapy":
     	return ScrapyWeatherScraper(
             storage_format=storage_format,
             output_filename=output_filename
     	)
 	else:
     	raise ValueError(
         	f"Unknown scraper engine: {engine}. Supported engines: 'bs4', 'scrapy'"
     	)

With this design, switching between engines becomes easy – you just need to change a command-line argument. More importantly, it lets you benchmark both approaches under identical conditions, using the same parser, same data models, and same storage logic. The only variable is the scraping engine itself.

BeautifulSoup Implementation

The BeautifulSoup implementation is straightforward – it directly integrates with Playwright through a BrowserService that manages browser lifecycle and page navigation. When you call scrape(), it opens a new browser tab, loads the URL, retrieves the HTML content, and passes it to the shared parser. The entire process is wrapped with retry logic and rate limiting to handle temporary failures and respect the target site.

Here’s the core scraping logic:

@retry_on_browser_error(max_attempts=3)
 async def scrape(self, location: Location) -> WeatherData:
 	try:
     	await rate_limiter.acquire()

     	url = settings.get_weather_url(location.location_id)
         logger.info(f"Scraping weather for {location.name} (ID: {location.location_id})")

     	page = await self.browser_service.new_page()

     	try:
         	await self.browser_service.goto(page, url)
             html_content = await self.browser_service.get_content(page)

         	# Use shared parser
             weather_data = self.parser.parse_html(html_content, location.name)
             weather_data.location_id = location.location_id
             weather_data.location_name = location.name

             logger.info(f"Successfully scraped weather for {location.name}")
         	return weather_data

     	finally:
         	await self.browser_service.close_page(page)

 	except Exception as e:
     	raise ScraperException(f"Scraping failed for {location.name}: {str(e)}") from e

@retry_on_browser_error(max_attempts=3)
 async def scrape(self, location: Location) -> WeatherData:
 	try:
     	await rate_limiter.acquire()

     	url = settings.get_weather_url(location.location_id)
         logger.info(f"Scraping weather for {location.name} (ID: {location.location_id})")

     	page = await self.browser_service.new_page()

     	try:
         	await self.browser_service.goto(page, url)
             html_content = await self.browser_service.get_content(page)

         	# Use shared parser
             weather_data = self.parser.parse_html(html_content, location.name)
             weather_data.location_id = location.location_id
             weather_data.location_name = location.name

             logger.info(f"Successfully scraped weather for {location.name}")
         	return weather_data

     	finally:
         	await self.browser_service.close_page(page)

 	except Exception as e:
     	raise ScraperException(f"Scraping failed for {location.name}: {str(e)}") from e

This approach gives you full control over the browser session – you can take screenshots, interact with elements, or handle custom authentication if needed. The tradeoff is that you’re managing the entire workflow yourself: opening pages, handling errors, and cleaning up resources. For single-location scrapes or prototyping, this level of control is an advantage, not a burden.

Scrapy Implementation

The Scrapy implementation takes a different approach. Instead of directly managing the browser, you define a Spider that declares which URLs to scrape and how to parse the responses. Scrapy handles the rest, including request scheduling, browser automation via scrapy-playwright, and pipeline processing. The spider itself is remarkably concise:

class BBCWeatherSpider(scrapy.Spider):
 	name = "bbc_weather"

 	def __init__(self, location: Optional[Location] = None, *args, **kwargs):
     	super().__init__(*args, **kwargs)
     	self.location = location
     	self.parser = BBCWeatherParser()

 	def start_requests(self):
     	url = settings.get_weather_url(self.location.location_id)

     	yield scrapy.Request(
         	url=url,
             callback=self.parse,
         	meta={
             	"playwright": True,
             	"playwright_page_methods": [
                     PageMethod("wait_for_timeout", settings.page_load_wait),
             	],
         	},
     	)

 	def parse(self, response):
         html_content = response.text

     	# Use shared parser
         weather_data = self.parser.parse_html(html_content, self.location.name)
         weather_data.location_id = self.location.location_id

     	yield weather_data

class BBCWeatherSpider(scrapy.Spider):
 	name = "bbc_weather"

 	def __init__(self, location: Optional[Location] = None, *args, **kwargs):
     	super().__init__(*args, **kwargs)
     	self.location = location
     	self.parser = BBCWeatherParser()

 	def start_requests(self):
     	url = settings.get_weather_url(self.location.location_id)

     	yield scrapy.Request(
         	url=url,
             callback=self.parse,
         	meta={
             	"playwright": True,
             	"playwright_page_methods": [
                     PageMethod("wait_for_timeout", settings.page_load_wait),
             	],
         	},
     	)

 	def parse(self, response):
         html_content = response.text

     	# Use shared parser
         weather_data = self.parser.parse_html(html_content, self.location.name)
         weather_data.location_id = self.location.location_id

     	yield weather_data

The spider yields data items, and Scrapy’s pipeline system automatically handles storage, error logging, and other post-processing. However, Scrapy runs in its own event loop, so integrating it into an async application requires wrapping it with a ThreadPoolExecutor. This adds complexity, but the payoff is clear: Scrapy excels at concurrent requests. If you need to scrape multiple locations simultaneously, Scrapy’s architecture handles parallelization out of the box, while BeautifulSoup would require manual async orchestration.

Performance Benchmark

Since both engines use identical parsing logic, the performance comparison becomes a pure test of the scraping frameworks themselves. I ran benchmarks on a single location (London) to measure end-to-end execution time, from initializing the scraper to saving the final data. Both engines fetched the same 332 hourly forecasts under identical network conditions.

Here are the results:

Engine	Time (seconds)	Use Case
BeautifulSoup	~7.2s	Single location, quick scrapes
Scrapy	~6.0s	Multiple locations, production

Scrapy is about 17% faster , but this difference is less significant than it appears. The real bottleneck is the browser, not the framework. Both engines spend most of their time waiting for Playwright to load the page, run JavaScript, and render the content. The actual scraping and parsing logic accounts for a fraction of a second. When the browser dominates execution time, framework overhead becomes negligible.

The real performance difference emerges when scraping multiple locations. BeautifulSoup processes requests sequentially by default – scrape London, wait, scrape Manchester, wait, and so on. Scrapy, however, is built for concurrency. With proper configuration, it can scrape 5 locations in roughly the same time BeautifulSoup takes for one. Scrapy automatically handles request scheduling, browser pooling, and parallel execution automatically.

For this weather scraper project, which targets single-location queries, the 17% speed advantage doesn’t justify Scrapy’s added complexity. BeautifulSoup is simpler to set up, easier to debug, and perfectly adequate for the task. But if the requirements changed – say, scraping weather data for 50 cities every hour – Scrapy would clearly be a better option.

When to Use What

The choice between BeautifulSoup and Scrapy isn’t about what is better – it’s about which tool fits your specific needs. Here’s how to decide:

Choose BeautifulSoup when:

• You’re scraping a single page or a handful of URLs

• You need fine-grained control over browser interactions (screenshots, authentication, complex navigation)

• You’re prototyping or building a one-off scraper

• You want minimal dependencies and simple debugging

• Your scraper is part of a larger async application where you control the event loop

Using BeautifulSoup with Playwright gives you a straightforward mental model: open the page, get the HTML, parse it, close the page. There’s no framework to learn, no settings to configure, and no hidden magic. You write procedural code that does exactly what it says.

Choose Scrapy when:

• You’re scraping dozens or hundreds of pages concurrently

• You need built-in pipelines for data processing, validation, and storage

• You’re building a production scraper that runs on a schedule

• You want automatic retry logic, rate limiting, and request scheduling

• You’re comfortable with the Spider/Pipeline architecture and declarative request handling

Scrapy shines when scraping becomes a system, not a script. If you’re building something that can scale, handle failures gracefully, and run reliably over time, Scrapy’s architecture pays off. The initial complexity becomes an investment, not a burden.

For this weather scraper, BeautifulSoup was the right choice. The project scrapes one location at a time, runs on demand rather than continuously, and benefits from the simplicity of direct browser control. Adding Scrapy provided a useful learning exercise and a fair performance comparison, but it didn’t solve a problem that BeautifulSoup couldn’t handle.

Conclusion

This project demonstrates that for most scraping tasks, BeautifulSoup is more than enough. The 17% performance difference with Scrapy disappears when the browser is the real bottleneck, not the framework. When you’re scraping a single page or a small number of URLs, Scrapy’s added complexity doesn’t provide meaningful benefits.

But this comparison also reveals where Scrapy excels: concurrent scraping at scale. If your requirements involve dozens of simultaneous requests, scheduled scraping runs, or complex data pipelines, Scrapy’s architecture is a good fit. In this case, the framework isn’t overkill but rather the right tool for the job.

More importantly, this project highlights the value of good architecture. By separating the scraping engine from the parsing logic, both implementations share the same BBCWeatherParser, data models, and validation rules. This approach eliminated code duplication, ensured fair benchmarking, and made switching between engines trivial. The lesson isn’t “BeautifulSoup vs Scrapy” – it’s that clean separation of concerns matters more than framework choice.

If you’re building a scraper and wondering which tool to use, start with BeautifulSoup. It’s simpler, faster to develop, and handles most use cases. Only reach for Scrapy when you’ve outgrown sequential scraping and need real concurrency. And regardless of which you choose, design your code so you can swap engines without having to rewrite everything.

Check out our blog for more technical insights.

If you seek top Python developers for hire, SysGears can help.