# ruoyi-python-spider **Repository Path**: supperbuoumi/ruoyi-python-spider ## Basic Information - **Project Name**: ruoyi-python-spider - **Description**: Ruoyi-Python-Crawler 基于 RuoYi-Boot 企业级快速开发框架,集成 Python 爬虫与数据处理能力,实现 Java(Spring Boot)负责业务系统,Python 负责数据采集与清洗 的混合架构。适用于需要大规模数据抓取、自动化清洗并与企业级后台管理的应用场景(如舆情监控、电商数据分析、科研数据采集等)。 - **Primary Language**: Java - **License**: Apache-2.0 - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 13 - **Forks**: 0 - **Created**: 2025-06-12 - **Last Updated**: 2025-09-25 ## Categories & Tags **Categories**: Uncategorized **Tags**: Python, Java, Vue ## README # RuoYi-Python-Crawler Based on the [RuoYi-Boot](https://github.com/ruoyi-cloud/ruoyi-boot) enterprise-level rapid development framework, **RuoYi-Python-Crawler** is a hybrid architecture solution that integrates Java (Spring Boot) and Python technology stacks. This framework fully leverages the advantages of Spring Boot in business system development while combining Python's powerful capabilities in data collection and cleaning to achieve efficient and stable large-scale data crawling and processing. --- ## Project Introduction **RuoYi-Python-Crawler** is a **Java + Python hybrid architecture background management system** for enterprises, suitable for scenarios that require large-scale data collection, automated cleaning, and integration with enterprise background management platforms. Through this project, developers can easily achieve the following goals: - Use **Spring Boot** to quickly build a secure and stable business system. - Utilize **Python web scraping technology** to achieve efficient data collection and cleaning. - Separate data collection tasks from business logic to enhance the maintainability and scalability of the system. - Support multiple application scenarios, such as public opinion monitoring, e-commerce data analysis, and scientific research data collection, etc. --- ## Architecture Features ### Spring Boot Dominated Business System - Rapidly build enterprise-level backend management systems based on RuoYi-Boot. - Provide basic functions such as user permission management, menu management, log management, and scheduled tasks. - Support a front-end and back-end separation architecture, making it easy to integrate with front-end frameworks like Vue.js and React. ### Python-driven Data Collection - Write spider modules using Python, supporting mainstream libraries such as Scrapy, Requests, and Selenium. - Flexibly configure parameters such as spider strategies, proxy IPs, and request frequencies. - Support preprocessing operations such as data cleaning, format conversion, and deduplication. ### Hybrid Architecture Design - Java is responsible for core business logic and system management. - Python is responsible for data collection and processing, decoupled from the Java layer as a service module. - Communication is carried out through REST API or RPC to ensure flexibility in data interaction between systems. ### Multiple Deployment Options Supported - Monolithic Deployment: Suitable for small-scale application scenarios, facilitating quick launch. - Distributed Deployment: Supports deploying Python crawler clusters on independent servers to enhance data collection efficiency. --- ## Technology Stack | Technology | Description | | Spring Boot | Core framework for quickly building enterprise-level applications | | MyBatis | ORM framework, simplifying database operations | | Redis | Data caching and task queue support | | Quartz | Timed task scheduler | | Python 3.x | Language environment for data collection and cleaning | | Scrapy / Requests / Selenium | Web crawler technology options | | FastAPI | Python interface service for communication with Java | | MySQL / MongoDB | Data storage support | --- ## Application Scenarios - **Public Opinion Monitoring System**: Real-time collection of news, comments, and social media content from the Internet, conducting sentiment analysis and trend prediction. - **E-commerce Data Analysis**: Automatic collection of product, price, sales volume and other data from major e-commerce platforms, generating visual reports. - **Scientific Research Data Collection**: Structured data collection from academic websites and paper databases in specific fields. - **Industry Monitoring System**: Continuous monitoring of content changes on target websites, triggering early warning mechanisms. --- ## Directory Structure Description ``` ruoyi-python-crawler/ ├── RuoYi/ ├───── ruoyi-admin/ # RuoYi Admin backend management module ├───── ruoyi-common/ # RuoYi common module ├───── ruoyi-framework/ # RuoYi framework module ├───── ruoyi-generator/ # RuoYi code generation module ├───── ruoyi-python/ # RuoYi Python extension module ├───── ruoyi-quartz/ # RuoYi scheduled task dispatching module ├───── ruoyi-system/ # RuoYi system module ├── crawler/ ├───── app/ # crawler app module ├───── collector/ # crawler data collection module ├───── utils/ # crawler utility module ├───── main.py # crawler program entry point ├───── requirements.txt # third-party libraries used by crawler ├── .gitignore # Git ignore file configuration ├── LICENSE # open source license file ├── README.md # project description document └── pom.xml # Maven project configuration file ``` --- ## Function List ### Back-end Management System - User Management: Role assignment, permission allocation, and login authentication. - Menu Management: Dynamic configuration of system menus and permission control. - Log Management: Recording user operation logs and system operation logs. - Parameter Management: Unified configuration of system parameters and crawler parameters. - Scheduled Tasks: Support for periodic triggering of crawler tasks. ### Python Web Crawler System - Crawler Task Management: Supports manual start, stop, and status check. - Crawler Parameter Configuration: Supports custom request headers, proxies, timeout settings, etc. - Data Cleaning Engine: Supports data processing methods such as regular expression extraction, XPath, and JSON parsing. - Data Output Interface: Supports multiple output formats including JSON, CSV, and databases. ### Data Communication Interface - FastAPI provides RESTful interfaces for Java to call. - Supports asynchronous task execution and result callbacks. - Offers a task status query interface for convenient progress display in the background. --- ## Quick Start ### 1. Clone Project ```bash git clone https://github.com/yourname/ruoyi-python-crawler.git cd ruoyi-python-crawler ``` ### 2. Start the Spring Boot Backend Service ```bash cd ruoyi-admin mvn spring-boot:run ``` To access the backend, please visit the following address: `http://localhost:8080` Default account: admin / admin123 ### 3. Start the Python Web Crawler Service ```bash cd .. /crawler pip install -r requirements.txt python main.py ``` To access the Python interface, visit: `http://localhost:8000/docs` (Swagger UI) --- ## Contribution Guide Welcome to contribute code, submit issues or pull requests to help improve this project! Please follow the following principles: - Fork the project and create your own branch. - Submit clear and meaningful commit messages. - Include a brief description of the changes when submitting a PR. - All code must comply with the standards and be verified through testing. --- ## License This project is open sourced under the [Apache License, Version 2.0](http://www.apache.org/licenses/LICENSE-2.0) --- Contact Us If you have any questions or suggestions, please feel free to contact the author or submit an issue. - Author's Email: 15801421798@163.com --- **RuoYi-Python-Crawler** is a perfect practice of the hybrid architecture of Java and Python, providing one-stop solutions for data-driven enterprises. Experience it now!