diff --git a/README.md b/README.md index 90d8abf..8d12cca 100644 --- a/README.md +++ b/README.md @@ -37,4 +37,48 @@ The output is a list of json objects, one per line. To display the results with ```bash $ jq -s '.' conferences.json -``` \ No newline at end of file +``` + +## Main idea / Explainaition + +We need to parse strings like the following + +```html +

Statistical and Computational Aspects of Dynamics
Organized by Buddhima +Kasun Fernando Akurugodage (Centro di ricerca matematica Ennio De Giorgi – SNS), +Paolo Giulietti, and Tanja Isabelle Schindler (Universität Wien, Austria). Centro De +Giorgi – SNS, Pisa. December 13 – 16, 2022.

+ +

Weekend di lavoro su Calcolo delle Variazioni
Organized by Giuseppe +Buttazzo, Maria Stella Gelli, and Aldo Pratelli. Grand Hotel Tettuccio Montecatini. +November 25 – 27, 2022.

+ +

Incontri di geometria algebrica ed aritmetica Milano – Pisa
Department +of Mathematics, Pisa. November 16 – 17, 2022.

+``` + +We compose the following conversation with an LLM, things between `{{ ... }}` are templates that will be replaced before starting generating. + +``` +INPUT: +

Statistical and Computational Aspects of Dynamics
Organized by Buddhima Kasun Fernando Akurugodage (Centro di ricerca matematica Ennio De Giorgi – SNS), Paolo Giulietti, and Tanja Isabelle Schindler (Universität Wien, Austria). Centro De Giorgi – SNS, Pisa. December 13 – 16, 2022.

+ +OUTPUT: +{ + "title": "Statistical and Computational Aspects of Dynamics", + "url": "http://www.crm.sns.it/event/507/", + "description": "Organized by Buddhima Kasun Fernando Akurugodage (Centro di ricerca matematica Ennio De Giorgi – SNS), Paolo Giulietti, and Tanja Isabelle Schindler (Universität Wien, Austria). Location: Centro De Giorgi - SNS, Pisa.", + "startDate": "2022-12-13", + "endDate": "2022-12-16" +} + +INPUT: +{{ conference_html }} + +OUTPUT: +``` + +And the LLM will complete this conversation with a json representation of the `{{ conference_html }}`. The first example is needed to show the model how to convert information from the input html data. \ No newline at end of file diff --git a/main.py b/main.py index 46e8880..556c2eb 100755 --- a/main.py +++ b/main.py @@ -5,13 +5,21 @@ from bs4 import BeautifulSoup import requests import json + OUTPUT_FILE = "conferences.json" HTML_EXAMPLE = r"""

Statistical and Computational Aspects of Dynamics
Organized by Buddhima Kasun Fernando Akurugodage (Centro di ricerca matematica Ennio De Giorgi – SNS), Paolo Giulietti, and Tanja Isabelle Schindler (Universität Wien, Austria). Centro De Giorgi – SNS, Pisa. December 13 – 16, 2022.

""" -OUTPUT_EXAMPLE = json.dumps( - { "title": "Statistical and Computational Aspects of Dynamics", "url": "http://www.crm.sns.it/event/507/", "description": "Organized by Buddhima Kasun Fernando Akurugodage (Centro di ricerca matematica Ennio De Giorgi – SNS), Paolo Giulietti, and Tanja Isabelle Schindler (Universität Wien, Austria). Location: Centro De Giorgi - SNS, Pisa.", "startDate": "2022-12-13", "endDate": "2022-12-16" } -) + + +OUTPUT_EXAMPLE = json.dumps({ + "title": "Statistical and Computational Aspects of Dynamics", + "url": "http://www.crm.sns.it/event/507/", + "description": "Organized by Buddhima Kasun Fernando Akurugodage (Centro di ricerca matematica Ennio De Giorgi – SNS), Paolo Giulietti, and Tanja Isabelle Schindler (Universität Wien, Austria). Location: Centro De Giorgi - SNS, Pisa.", + "startDate": "2022-12-13", + "endDate": "2022-12-16" +}) + def translate_to_json(conference_html: str) -> str: llm_answer = llm.create_chat_completion(