You cannot select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
Francesco Minnocci 2a29c9e09f | 10 months ago | |
---|---|---|
.gitignore | 10 months ago | |
README.md | 10 months ago | |
main.py | 10 months ago | |
requirements.txt | 10 months ago |
README.md
Past Conferences Crawler (with LLM)
A Python script that crawls conferences from https://www.dm.unipi.it/research/past-conferences/ and processes them using a local-run LLM (we used Mistral-7B-Instruct-v0.2) to translate the natural language info to a more structured format (json).
Installation
Download the LLM model to use, specifically we use Mistral-7B-Instruct-v0.2-GGUF
:
# download the model, ~4GB
$ wget https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/blob/main/mistral-7b-instruct-v0.2.Q4_K_M.gguf
Install python the requirements:
# if you want create a venv
$ python -m venv venv
$ source venv/bin/activate
# enable gpu support for llama-cpp
$ export CMAKE_ARGS="-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS"
# install requirements
$ pip install -r requirements.txt
Launch
The following command will crawl all the conferences from https://www.dm.unipi.it/research/past-conferences/
(all pages) and save the results in conferences.json
as a list of json objects, one per line.
$ python main.py
The output is a list of json objects, one per line. To display the results with jq
:
$ jq -s '.' conferences.json