|
|
# Past Conferences Crawler (with LLM)
|
|
|
|
|
|
A Python script that crawls conferences from <https://www.dm.unipi.it/research/past-conferences/> and processes them using a local-run LLM (we used [Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2)) to translate the natural language info to a more structured format (json).
|
|
|
|
|
|
## Installation
|
|
|
|
|
|
Download the LLM model to use, specifically we use [`Mistral-7B-Instruct-v0.2-GGUF`](https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF):
|
|
|
|
|
|
```bash
|
|
|
# download the model, ~4GB
|
|
|
$ wget https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/blob/main/mistral-7b-instruct-v0.2.Q4_K_M.gguf
|
|
|
```
|
|
|
|
|
|
Install python the requirements:
|
|
|
|
|
|
```bash
|
|
|
# if you want create a venv
|
|
|
$ python -m venv venv
|
|
|
$ source venv/bin/activate
|
|
|
|
|
|
# enable gpu support for llama-cpp
|
|
|
$ export CMAKE_ARGS="-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS"
|
|
|
|
|
|
# install requirements
|
|
|
$ pip install -r requirements.txt
|
|
|
```
|
|
|
|
|
|
## Launch
|
|
|
|
|
|
The following command will crawl all the conferences from `https://www.dm.unipi.it/research/past-conferences/` (all pages) and save the results in `conferences.json` as a list of json objects, one per line.
|
|
|
|
|
|
```bash
|
|
|
$ python main.py
|
|
|
```
|
|
|
|
|
|
The output is a list of json objects, one per line. To display the results with `jq`:
|
|
|
|
|
|
```bash
|
|
|
$ jq -s '.' conferences.json
|
|
|
```
|
|
|
|
|
|
## Main idea / Explainaition
|
|
|
|
|
|
We need to parse strings like the following
|
|
|
|
|
|
```html
|
|
|
<p><a href="http://www.crm.sns.it/event/507/" target="_blank" rel="noreferrer
|
|
|
noopener">Statistical and Computational Aspects of Dynamics<br></a>Organized by Buddhima
|
|
|
Kasun Fernando Akurugodage (Centro di ricerca matematica Ennio De Giorgi – SNS),
|
|
|
Paolo Giulietti, and Tanja Isabelle Schindler (Universität Wien, Austria). Centro De
|
|
|
Giorgi – SNS, Pisa. December 13 – 16, 2022.</p>
|
|
|
|
|
|
<p><a href="https://events.dm.unipi.it/event/126/" target="_blank" rel="noreferrer
|
|
|
noopener">Weekend di lavoro su Calcolo delle Variazioni<br></a>Organized by Giuseppe
|
|
|
Buttazzo, Maria Stella Gelli, and Aldo Pratelli. Grand Hotel Tettuccio Montecatini.
|
|
|
November 25 – 27, 2022.</p>
|
|
|
|
|
|
<p><a href="https://events.dm.unipi.it/event/109/" target="_blank" rel="noreferrer
|
|
|
noopener">Incontri di geometria algebrica ed aritmetica Milano – Pisa<br></a>Department
|
|
|
of Mathematics, Pisa. November 16 – 17, 2022.</p>
|
|
|
```
|
|
|
|
|
|
We compose the following conversation with an LLM, things between `{{ ... }}` are templates that will be replaced before starting generating.
|
|
|
|
|
|
```
|
|
|
INPUT:
|
|
|
<p><a href="http://www.crm.sns.it/event/507/" target="_blank" rel="noreferrer
|
|
|
noopener">Statistical and Computational Aspects of Dynamics<br></a>Organized by Buddhima
|
|
|
Kasun Fernando Akurugodage (Centro di ricerca matematica Ennio De Giorgi – SNS),
|
|
|
Paolo Giulietti, and Tanja Isabelle Schindler (Universität Wien, Austria). Centro De
|
|
|
Giorgi – SNS, Pisa. December 13 – 16, 2022.</p>
|
|
|
|
|
|
OUTPUT JSON:
|
|
|
{
|
|
|
"title": "Statistical and Computational Aspects of Dynamics",
|
|
|
"url": "http://www.crm.sns.it/event/507/",
|
|
|
"description": "Organized by Buddhima Kasun Fernando Akurugodage (Centro di ricerca
|
|
|
matematica Ennio De Giorgi – SNS), Paolo Giulietti, and Tanja Isabelle
|
|
|
Schindler (Universität Wien, Austria). Location: Centro De Giorgi - SNS, Pisa.",
|
|
|
"startDate": "2022-12-13",
|
|
|
"endDate": "2022-12-16"
|
|
|
}
|
|
|
|
|
|
INPUT:
|
|
|
{{ conference_html }}
|
|
|
|
|
|
OUTPUT JSON:
|
|
|
```
|
|
|
|
|
|
And the LLM will complete this conversation with a json representation of the `{{ conference_html }}`. The first example is needed to show the model how to convert information from the input html data. |