A Comparison of 6 Open-Source AI Web Crawling Frameworks——ScrapeGraphAI, Skyvern, Firecrawl, Crawl4AI, Reader, and Markdowner

本内容同时提供以下语言的翻译:简体中文

This is the third article in this series, focusing on how to scrape data from the web to enrich the context of large models. Whether for personal AI search engines or enterprise-level knowledge base applications, obtaining real-time web data is a crucial function. In particular, updates to web page information help improve the accuracy and timeliness of large model responses. The methods for processing local documents (especially PDF files, scanned imprints, images, etc.) were discussed in detail in the previous article.

Before diving into the discussion, it’s important to clarify the concept of AI crawlers (also known as LLM crawlers). They can be broadly divided into two categories. The first category includes conventional crawling tools where the results are directly used as context for LLMs. Strictly speaking, these are not directly related to AI. The second category consists of new LLM-driven crawling solutions, where users specify data collection targets in natural language. Subsequently, the LLM autonomously analyzes the web page structure, develops crawling strategies, executes interactive operations to acquire dynamic data, and ultimately returns structured target content.

LLM-driven New Crawlers

LLM-Driven Crawling Solutions

For a detailed understanding of the ideas and practical methods behind general AI-driven web crawlers, refer to this article. The author provides a comprehensive explanation, from conceptualization to solutions, optimization, and result analysis, filled with valuable insights. Here, I aim to provide a quick overview of the process. The entire process closely simulates human operational steps:

  1. First, the entire HTML code of the web page is crawled.

  2. Then, AI is used to generate a series of related terms. For example, when looking for prices, AI might generate keywords such as “prices,” “fee,” “cost,” etc.

  3. Based on these keywords, the HTML structure is searched to locate relevant node lists.

  4. AI is used to analyze the node lists to determine the most relevant nodes.

  5. AI is applied to determine if interaction with the node is necessary (usually a click operation).

  6. The above steps are repeated until the final result is obtained.

    AI Web Crawling Process

Skyvern

Skyvern is a multimodal model-based browser automation tool designed to enhance the efficiency and adaptability of workflows. Unlike traditional automation methods, which typically rely on specific website scripts, DOM parsing, and XPath paths that are prone to failure when website layouts change, Skyvern analyzes visual elements in the browser window in real-time. It combines this analysis with LLM-generated interaction plans, allowing it to operate on unknown websites without custom code and with higher resilience to layout changes. By integrating browser automation libraries like Playwright, it automates browser-based workflows using several key agents:

  • Interactable Element Agent: Responsible for parsing the HTML structure of a web page and extracting interactable elements.
  • Navigation Agent: Responsible for planning the navigation paths required to complete tasks, such as clicking buttons and entering text.
  • Data Extraction Agent: Responsible for extracting data from web pages, capable of reading tables and text, and outputting data in user-defined structured formats.
  • Password Agent: Responsible for filling out website password forms, able to read usernames and passwords from password managers while protecting user privacy.
  • 2FA Agent: Responsible for filling out two-factor authentication (2FA) forms, capable of intercepting website 2FA requests and obtaining 2FA codes through a user-defined API or waiting for user manual input.
  • Dynamic Auto-complete Agent: Responsible for filling out dynamically auto-completing forms, able to select appropriate options based on user input and form feedback, and adjust input content.

ScrapegraphAI

ScrapeGraphAI automates the construction of scraping pipelines through large language models and graph logic, reducing the need for manual coding. Users only need to specify the required information, and ScrapeGraphAI can automatically handle single-page or multi-page scraping tasks, efficiently extracting web page data. Supporting various document formats such as XML, HTML, JSON, and Markdown, ScrapeGraphAI provides several types of scraping methods, including:

  • SmartScraperGraph: Achieves single-page scraping with just a user prompt and input source.
  • SearchGraph: A multi-page scraper that extracts information from top search results.
  • SpeechGraph: A single-page scraper that converts website content into audio files.
  • ScriptCreatorGraph: A single-page scraper that creates Python scripts for extracted data.
  • SmartScraperMultiGraph: Achieves multi-page scraping through a single prompt and a series of sources.
  • ScriptCreatorMultiGraph: A multi-page scraper that extracts information from multiple pages and sources, and creates corresponding Python scripts.

ScrapeGraphAI simplifies the web scraping process. Users do not need in-depth programming knowledge; they only need to provide information needs to automate scraping tasks. It supports scraping from single pages to multiple pages and is suitable for data extraction tasks of varying scales. It also provides different types of scraping pipelines to meet various needs, including information extraction, audio generation, and script creation.

Conventional Web Crawling Tools

These tools clean and convert regular online web content into Markdown format, which allows large models to better understand and process the data. (The response quality of large models is higher when the data is presented in a structured and Markdown format). The converted content is used as context for LLMs, enabling the models to answer questions by combining online resources.

Crawl4AI

Crawl4AI is an open-source web crawling and data extraction framework designed specifically for AI applications. It allows for simultaneous crawling of multiple URLs, greatly reducing the time required for large-scale data collection. Key features of Crawl4AI that stand out in the field of web crawling include:

  1. Multiple Output Formats: Supports multiple output formats such as JSON, minimal HTML, and Markdown.
  2. Dynamic Content Support: Through custom JavaScript code, Crawl4AI can simulate user behavior, such as clicking the “next page” button, to load more dynamic content. This approach enables Crawl4AI to handle common dynamic content loading mechanisms like pagination and infinite scrolling.
  3. Multiple Chunking Strategies: Supports multiple chunking strategies such as topic, regular expressions, and sentences, allowing users to customize data according to specific needs.
  4. Media Extraction: Employs powerful methods like XPath and regular expressions, enabling users to precisely locate and extract desired data. It can extract various media types, including images, audio, and video, which is particularly useful for applications relying on multimedia content.
  5. Customizable Hooks: Users can define custom hooks, such as the on_execution_started hook that is executed at the beginning of a crawl. This can be used to ensure that all necessary JavaScript is executed before starting the crawl and that dynamic content is loaded on the page.
  6. High Stability: Crawling dynamic content can fail due to network issues or JavaScript execution errors. Crawl4AI’s error handling and retry mechanisms ensure that even if these problems are encountered, it will retry, ensuring the integrity and accuracy of the data.

Reader

A web content scraping tool developed by Jina AI, users only need to enter a URL to clean and format the page content, outputting it in plain text or Markdown format. The purpose is to convert any web page into an input format suitable for large models to understand, i.e., to convert rich text content into plain text, such as converting images into descriptive text.

Reader Functionality

Firecrawl

Firecrawl is designed to be more elegant and powerful than Reader, resembling a mature product. It provides a simplified API for crawling and data extraction from entire websites. Firecrawl can convert website content into formats such as Markdown, formatted data, screenshots, simplified HTML, hyperlinks, and metadata to better support the use of LLMs. Additionally, Firecrawl can handle complex tasks such as proxy settings, anti-crawling mechanisms, processing dynamic content (such as JavaScript rendering), output parsing, and task coordination. Developers can customize the behavior of the crawler, such as excluding specific tags, crawling pages that require authentication, and setting the maximum crawl depth. Firecrawl supports data parsing for various media types, including PDF, DOCX documents, and images. Its reliability ensures the effective acquisition of required data in various complex environments. Users can interact with web pages through simulated clicks, scrolls, and inputs. The latest version also supports batch processing of large numbers of URLs.

Markdowner

If the paid quota of the previous two tools is insufficient or if self-deployment consumes too many resources, consider using Markdowner. Markdowner can convert website content to Markdown format. Although its features are not as diverse as Firecrawl, it is sufficient for daily needs. The tool supports automatic crawling, LLM filtering, detailed Markdown modes, and text and JSON response formats. Markdowner provides an API interface, allowing users to access it through GET requests and customize the response type and content through URL parameters. Technically, Markdowner utilizes Cloudflare Workers and the Turndown library for web page content conversion.

Others

Similar crawling tools include webscraper, code-html-to-markdown (particularly adept at handling code blocks), MarkdownDown, gpt-api, and web.scraper.workers.dev (a tool that I constantly use, supports content filtering, and with slight modifications, it can access paid content). These tools, once self-deployed, can be used as plugins for large models to access online content, serving as essential tools in the data preprocessing stage.

A Comparison of 6 Open-Source AI Web Crawling Frameworks——ScrapeGraphAI, Skyvern, Firecrawl, Crawl4AI, Reader, and Markdowner

https://liduos.com/en/ai-develope-tools-series-3-open-source-ai-web-crawler-frameworks.html

Author

莫尔索

Posted on

2024-12-10

Updated on

2025-01-05

Licensed under

Comments