A python script that crawls conferences and process them using a local LLM to translate info to a structured format
You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
Antonio De Lucreziis 8c6364277e initial commit, reworked @BachoSeven's code 9 months ago
.gitignore initial commit, reworked @BachoSeven's code 9 months ago
README.md initial commit, reworked @BachoSeven's code 9 months ago
main.py initial commit, reworked @BachoSeven's code 9 months ago
requirements.txt initial commit, reworked @BachoSeven's code 9 months ago

README.md

Past Conferences Crawler (with LLM)

A Python script that crawls conferences from https://www.dm.unipi.it/research/past-conferences/ and processes them using a local-run LLM (we used Mistral-7B-Instruct-v0.2) to translate the natural language info to a more structured format (json).

Installation

Download the LLM model to use, specifically we use Mistral-7B-Instruct-v0.2-GGUF:

# download the model, ~4GB
$ wget https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/blob/main/mistral-7b-instruct-v0.2.Q4_K_M.gguf

Install python the requirements:

# if you want create a venv
$ python -m venv venv
$ source venv/bin/activate

# enable gpu support for llama-cpp
$ export CMAKE_ARGS="-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS"

# install requirements
$ pip install -r requirements.txt

Launch

The following command will crawl the conferences from https://www.dm.unipi.it/research/past-conferences/ (pages 1 to 5) and save the results in conferences.json:

$ python main.py

The output is a list of json objects, one per line. To display the results with jq:

$ jq -s '.' conferences.json