You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

90 lines
3.5 KiB
Markdown

This file contains ambiguous Unicode characters!

This file contains ambiguous Unicode characters that may be confused with others in your current locale. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to highlight these characters.

# Past Conferences Crawler (with LLM)
A Python script that crawls conferences from <https://www.dm.unipi.it/research/past-conferences/> and processes them using a local-run LLM (we used [Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2)) to translate the natural language info to a more structured format (json).
## Installation
Download the LLM model to use, specifically we use [`Mistral-7B-Instruct-v0.2-GGUF`](https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF):
```bash
# download the model, ~4GB
$ wget https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/blob/main/mistral-7b-instruct-v0.2.Q4_K_M.gguf
```
Install python the requirements:
```bash
# if you want create a venv
$ python -m venv venv
$ source venv/bin/activate
# enable gpu support for llama-cpp
$ export CMAKE_ARGS="-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS"
# install requirements
$ pip install -r requirements.txt
```
## Launch
The following command will crawl all the conferences from `https://www.dm.unipi.it/research/past-conferences/` (all pages) and save the results in `conferences.json` as a list of json objects, one per line.
```bash
$ python main.py
```
The output is a list of json objects, one per line. To display the results with `jq`:
```bash
$ jq -s '.' conferences.json
```
## Main idea / Explainaition
We need to parse strings like the following
```html
<p><a href="http://www.crm.sns.it/event/507/" target="_blank" rel="noreferrer
noopener">Statistical and Computational Aspects of Dynamics<br></a>Organized by Buddhima
Kasun Fernando Akurugodage (Centro di ricerca matematica Ennio De Giorgi &#8211; SNS),
Paolo Giulietti, and Tanja Isabelle Schindler (Universität Wien, Austria). Centro De
Giorgi &#8211; SNS, Pisa. December 13 &#8211; 16, 2022.</p>
<p><a href="https://events.dm.unipi.it/event/126/" target="_blank" rel="noreferrer
noopener">Weekend di lavoro su Calcolo delle Variazioni<br></a>Organized by Giuseppe
Buttazzo, Maria Stella Gelli, and Aldo Pratelli. Grand Hotel Tettuccio Montecatini.
November 25 &#8211; 27, 2022.</p>
<p><a href="https://events.dm.unipi.it/event/109/" target="_blank" rel="noreferrer
noopener">Incontri di geometria algebrica ed aritmetica Milano Pisa<br></a>Department
of Mathematics, Pisa. November 16 &#8211; 17, 2022.</p>
```
We compose the following conversation with an LLM, things between `{{ ... }}` are templates that will be replaced before starting generating.
```
INPUT:
<p><a href="http://www.crm.sns.it/event/507/" target="_blank" rel="noreferrer
noopener">Statistical and Computational Aspects of Dynamics<br></a>Organized by Buddhima
Kasun Fernando Akurugodage (Centro di ricerca matematica Ennio De Giorgi &#8211; SNS),
Paolo Giulietti, and Tanja Isabelle Schindler (Universität Wien, Austria). Centro De
Giorgi &#8211; SNS, Pisa. December 13 &#8211; 16, 2022.</p>
OUTPUT JSON:
{
"title": "Statistical and Computational Aspects of Dynamics",
"url": "http://www.crm.sns.it/event/507/",
"description": "Organized by Buddhima Kasun Fernando Akurugodage (Centro di ricerca
matematica Ennio De Giorgi SNS), Paolo Giulietti, and Tanja Isabelle
Schindler (Universität Wien, Austria). Location: Centro De Giorgi - SNS, Pisa.",
"startDate": "2022-12-13",
"endDate": "2022-12-16"
}
INPUT:
{{ conference_html }}
OUTPUT JSON:
```
And the LLM will complete this conversation with a json representation of the `{{ conference_html }}`. The first example is needed to show the model how to convert information from the input html data.