# Past Conferences Crawler (with LLM) A Python script that crawls conferences from and processes them using a local-run LLM (we used [Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2)) to translate the natural language info to a more structured format (json). ## Installation Download the LLM model to use, specifically we use [`Mistral-7B-Instruct-v0.2-GGUF`](https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF): ```bash # download the model, ~4GB $ wget https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/blob/main/mistral-7b-instruct-v0.2.Q4_K_M.gguf ``` Install python the requirements: ```bash # if you want create a venv $ python -m venv venv $ source venv/bin/activate # enable gpu support for llama-cpp $ export CMAKE_ARGS="-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS" # install requirements $ pip install -r requirements.txt ``` ## Launch The following command will crawl all the conferences from `https://www.dm.unipi.it/research/past-conferences/` (all pages) and save the results in `conferences.json` as a list of json objects, one per line. ```bash $ python main.py ``` The output is a list of json objects, one per line. To display the results with `jq`: ```bash $ jq -s '.' conferences.json ``` ## Main idea / Explainaition We need to parse strings like the following ```html

Statistical and Computational Aspects of Dynamics
Organized by Buddhima Kasun Fernando Akurugodage (Centro di ricerca matematica Ennio De Giorgi – SNS), Paolo Giulietti, and Tanja Isabelle Schindler (Universität Wien, Austria). Centro De Giorgi – SNS, Pisa. December 13 – 16, 2022.

Weekend di lavoro su Calcolo delle Variazioni
Organized by Giuseppe Buttazzo, Maria Stella Gelli, and Aldo Pratelli. Grand Hotel Tettuccio Montecatini. November 25 – 27, 2022.

Incontri di geometria algebrica ed aritmetica Milano – Pisa
Department of Mathematics, Pisa. November 16 – 17, 2022.

``` We compose the following conversation with an LLM, things between `{{ ... }}` are templates that will be replaced before starting generating. ``` INPUT:

Statistical and Computational Aspects of Dynamics
Organized by Buddhima Kasun Fernando Akurugodage (Centro di ricerca matematica Ennio De Giorgi – SNS), Paolo Giulietti, and Tanja Isabelle Schindler (Universität Wien, Austria). Centro De Giorgi – SNS, Pisa. December 13 – 16, 2022.

OUTPUT JSON: { "title": "Statistical and Computational Aspects of Dynamics", "url": "http://www.crm.sns.it/event/507/", "description": "Organized by Buddhima Kasun Fernando Akurugodage (Centro di ricerca matematica Ennio De Giorgi – SNS), Paolo Giulietti, and Tanja Isabelle Schindler (Universität Wien, Austria). Location: Centro De Giorgi - SNS, Pisa.", "startDate": "2022-12-13", "endDate": "2022-12-16" } INPUT: {{ conference_html }} OUTPUT JSON: ``` And the LLM will complete this conversation with a json representation of the `{{ conference_html }}`. The first example is needed to show the model how to convert information from the input html data.