Spaces:
Runtime error
A newer version of the Streamlit SDK is available:
1.51.0
批踢踢爬蟲 ptt-crawler
學術研究用途,請勿不當使用。
This project scrapes the post details from the website PTT, and writes the scraped items to csv files.
| author | alias | title | date | ip | city | country | ups | downs | comments | url |
|---|---|---|---|---|---|---|---|---|---|---|
| jason789780 | majiLove | [請益] google問題的精確與方向 | 2022-09-06 10:39:42 | 223.137.68.113 | Yilan | Taiwan | 9 | 0 | 29 | https://www.ptt.cc/bbs/Soft_Job/M.1662431984.A.A3F.html |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
說明
1. Installation
Python version
python >= 3.9
Clone repository
git clone git@github.com:lopentu/nlp_web.gitInstall Requirement
cd nlp_web/assignments/ptt-crawler && pip install -r requirements.txt
2. 使用方式
- Commands
scrapy crawl ptt -a boards=BOARDS [-a scrap_all=BOOLEAN]
[-a index_from=NUMBER -a index_to=NUMBER]
[-a since=YEAR] [-a data_dir=PATH]
[-a ip_cache=BOOLEAN]
positional arguments:
-a boards=BOARDS ptt board name (e.g. Soft_Job)
-a index_from=NUMBER -a index_to=NUMBER html index number from a ptt board
-a scrap_all=BOOLEAN scrap all posts if true
-a since=YEAR scrap all posts from a given year
-a data_dir=PATH output file path (default: ./data)
-a ip_cache=BOOLEAN enable redis service to cache ip if true
If you enable
ip_cache, please make sure you have Redis on your local machine. Otherwise, you can usedocker-composeto run the crawler.
Crawl all the posts of a board:
scrapy crawl ptt -a boards=Soft_Job -a scrap_all=TrueCrawl all the posts of a board from a year in the past:
scrapy crawl ptt -a boards=Soft_Job -a since=2020Crawl the posts of a board based on html indexes:
scrapy crawl ptt -a boards=Soft_Job -a index_from=1722 -a index_to=1723Please make sure the number of
index_fromis greater thanindex_to.Crawl the posts of multiple boards. For example:
scrapy crawl ptt -a boards=Soft_Job,Gossiping -a index_from=1722 -a index_to=1723Note: the comma in the argument
boardscannot have spaces. It cannot beboards=Soft_Job, Baseballorboards=["Soft_Job", "Baseball"].
3. 使用 Docker
A Docker setup is provided for the crawler.
To run the crawler, go to the docker-compose.yml file to edit the command:
version: "3"
services:
scraptt:
build: .
environment:
- PYTHON_ENV=production
depends_on:
- redis
links:
- redis
# define your crawler here!
command: bash -c "scrapy crawl ptt -a boards=Soft_Job,Gossiping -a index_from=1722 -a index_to=1723 -a ip_cache=True"
Feel free to change the command as long as it follows the command format as stated above.
Now start the crawler:
docker-compose up
We suggest setting the argument ip_cache to true when you run the crawler via Docker. The reason is that the crawler will connect to Redis container, and there is no need for you to have Redis on your local machine. Most importantly, the performance of the crawler will be increased!
Contact
If you have any suggestion or question, please do not hesitate to email us at shukai@gmail.com or philcoke35@gmail.com