diff --git a/.gitignore b/.gitignore index 83acb35..652576f 100644 --- a/.gitignore +++ b/.gitignore @@ -1,6 +1,9 @@ # Local files *.local* +# Output files +*.json + # Python venv/ diff --git a/README.md b/README.md index 676fd0b..90d8abf 100644 --- a/README.md +++ b/README.md @@ -27,7 +27,7 @@ $ pip install -r requirements.txt ## Launch -The following command will crawl the conferences from `https://www.dm.unipi.it/research/past-conferences/` (pages 1 to 5) and save the results in `conferences.json`: +The following command will crawl all the conferences from `https://www.dm.unipi.it/research/past-conferences/` (all pages) and save the results in `conferences.json` as a list of json objects, one per line. ```bash $ python main.py diff --git a/main.py b/main.py index d71f464..daba610 100755 --- a/main.py +++ b/main.py @@ -6,6 +6,8 @@ from bs4 import BeautifulSoup import textwrap import json +OUTPUT_FILE = "conferences.json" + LLM_EXAMPLE = ( "INPUT:\n" '

Statistical' @@ -99,10 +101,10 @@ llm = Llama(model_path="./mistral-7b-instruct-v0.2.Q4_K_M.gguf", chat_format="ll # clear the result file -open("conferences.json", "w").close() +open(OUTPUT_FILE, "w").close() # the result file is a sequence of json objects, one per line -results_file = open("conferences.json", "a") +results_file = open(OUTPUT_FILE, "a") for conference_html in conference_html_snippets: print("Translating:")