A python script that crawls conferences and process them using a local LLM to translate info to a structured format
You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
Antonio De Lucreziis ce3ba63080 fix: wrong syntax 10 months ago
.gitignore chore: minor enhancements 10 months ago
README.md fix: updated readme 10 months ago
main.py fix: wrong syntax 10 months ago
requirements.txt initial commit, reworked @BachoSeven's code 10 months ago

README.md

Past Conferences Crawler (with LLM)

A Python script that crawls conferences from https://www.dm.unipi.it/research/past-conferences/ and processes them using a local-run LLM (we used Mistral-7B-Instruct-v0.2) to translate the natural language info to a more structured format (json).

Installation

Download the LLM model to use, specifically we use Mistral-7B-Instruct-v0.2-GGUF:

# download the model, ~4GB
$ wget https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/blob/main/mistral-7b-instruct-v0.2.Q4_K_M.gguf

Install python the requirements:

# if you want create a venv
$ python -m venv venv
$ source venv/bin/activate

# enable gpu support for llama-cpp
$ export CMAKE_ARGS="-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS"

# install requirements
$ pip install -r requirements.txt

Launch

The following command will crawl all the conferences from https://www.dm.unipi.it/research/past-conferences/ (all pages) and save the results in conferences.json as a list of json objects, one per line.

$ python main.py

The output is a list of json objects, one per line. To display the results with jq:

$ jq -s '.' conferences.json

Main idea / Explainaition

We need to parse strings like the following

<p><a href="http://www.crm.sns.it/event/507/" target="_blank" rel="noreferrer
noopener">Statistical and Computational Aspects of Dynamics<br></a>Organized by Buddhima
Kasun Fernando Akurugodage (Centro di ricerca matematica Ennio De Giorgi &#8211; SNS),
Paolo Giulietti, and Tanja Isabelle Schindler (Universität Wien, Austria). Centro De
Giorgi &#8211; SNS, Pisa. December 13 &#8211; 16, 2022.</p>

<p><a href="https://events.dm.unipi.it/event/126/" target="_blank" rel="noreferrer
noopener">Weekend di lavoro su Calcolo delle Variazioni<br></a>Organized by Giuseppe
Buttazzo, Maria Stella Gelli, and Aldo Pratelli. Grand Hotel Tettuccio Montecatini.
November 25 &#8211; 27, 2022.</p>

<p><a href="https://events.dm.unipi.it/event/109/" target="_blank" rel="noreferrer
noopener">Incontri di geometria algebrica ed aritmetica Milano  Pisa<br></a>Department
of Mathematics, Pisa. November 16 &#8211; 17, 2022.</p>

We compose the following conversation with an LLM, things between {{ ... }} are templates that will be replaced before starting generating.

INPUT:
<p><a href="http://www.crm.sns.it/event/507/" target="_blank" rel="noreferrer
noopener">Statistical and Computational Aspects of Dynamics<br></a>Organized by Buddhima
Kasun Fernando Akurugodage (Centro di ricerca matematica Ennio De Giorgi &#8211; SNS),
Paolo Giulietti, and Tanja Isabelle Schindler (Universität Wien, Austria). Centro De
Giorgi &#8211; SNS, Pisa. December 13 &#8211; 16, 2022.</p>

OUTPUT JSON:
{ 
    "title": "Statistical and Computational Aspects of Dynamics",
    "url": "http://www.crm.sns.it/event/507/",
    "description": "Organized by Buddhima Kasun Fernando Akurugodage (Centro di ricerca
        matematica Ennio De Giorgi  SNS), Paolo Giulietti, and Tanja Isabelle
        Schindler (Universität Wien, Austria). Location: Centro De Giorgi - SNS, Pisa.",
    "startDate": "2022-12-13",
    "endDate": "2022-12-16"
}

INPUT:
{{ conference_html }}

OUTPUT JSON:

And the LLM will complete this conversation with a json representation of the {{ conference_html }}. The first example is needed to show the model how to convert information from the input html data.