Python scraper based on AI
-
Updated
Jul 3, 2025 - Python
Python scraper based on AI
Ingest, parse, and optimize any data format ➡️ from documents to multimedia ➡️ for enhanced compatibility with GenAI frameworks
Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.
The Ultimate Information Gathering Toolkit
简单易用的Python爬虫框架,QQ交流群:597510560
Official repository for "Craw4LLM: Efficient Web Crawling for LLM Pretraining"
Scalable Python web scraping scripts for +40 popular domains
Opensource Korean chatbot framework
The only open-source toolkit that can download SEC EDGAR financial reports and extract textual data from specific item sections into nice & clean structured JSON files. Presented at WWW 2025 @ Sydney, Australia (https://dl.acm.org/doi/10.1145/3701716.3715289)
The simple, easy to use command line web crawler.
Undetected web-scraping & seamless HTML parsing in Python!
Data Analysis & Mining for lagou.com
旨在将网易云、酷狗、QQ、酷我等各音乐平台集于一体
A simple distributed crawler for zhihu && data analysis
Ignareo the Carillon, a web crawler/spider template of ultimate high concurrency built for leprechauns. Carillons as the best web spiders; Long live the golden years of leprechauns! (ISML=international saimoe; 2022 ISML is last ISML)
Python script, which empowers people with no programming background to generate robust leads on a mass scale. This repo will be compiled of various versatile techniques used in lead generation.
Easy way to brute-force web directory.
Scrape data from Goodreads using Scrapy and Selenium 📚
Add a description, image, and links to the web-crawler topic page so that developers can more easily learn about it.
To associate your repository with the web-crawler topic, visit your repo's landing page and select "manage topics."