| # Generate paper json files from a collection xml file, with fulltext extraction. | |
| This is a slightly re-arranged version of Sotaro Takeshita's code, which is available at https://github.com/gengo-proj/data-factory. | |
| ## Requirements | |
| - Docker | |
| - Python>=3.10 | |
| - python packages: | |
| - acl-anthology-py>=0.4.3 | |
| - bs4 | |
| - jsonschema | |
| ## Setup | |
| Start Grobid Docker container | |
| ```bash | |
| docker run --rm --init --ulimit core=0 -p 8070:8070 lfoppiano/grobid:0.8.0 | |
| ``` | |
| Get the meta data from ACL Anthology | |
| ```bash | |
| git clone git@github.com:acl-org/acl-anthology.git | |
| ``` | |
| ## Usage | |
| ```bash | |
| python src/data/acl_anthology_crawler.py \ | |
| --base-output-dir <path/to/save/raw-paper.json> \ | |
| --pdf-output-dir <path/to/save/downloaded/paper.pdf> \ | |
| --anthology-data-dir ./acl-anthology/data/ | |
| ``` | |