# Past Conferences Crawler (with LLM) A Python script that crawls conferences from and processes them using a local-run LLM (we used [Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2)) to translate the natural language info to a more structured format (json). ## Installation Download the LLM model to use, specifically we use [`Mistral-7B-Instruct-v0.2-GGUF`](https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF): ```bash # download the model, ~4GB $ wget https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/blob/main/mistral-7b-instruct-v0.2.Q4_K_M.gguf ``` Install python the requirements: ```bash # if you want create a venv $ python -m venv venv $ source venv/bin/activate # enable gpu support for llama-cpp $ export CMAKE_ARGS="-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS" # install requirements $ pip install -r requirements.txt ``` ## Launch The following command will crawl all the conferences from `https://www.dm.unipi.it/research/past-conferences/` (all pages) and save the results in `conferences.json` as a list of json objects, one per line. ```bash $ python main.py ``` The output is a list of json objects, one per line. To display the results with `jq`: ```bash $ jq -s '.' conferences.json ```