til.simonwillison.net
Open in
urlscan Pro
76.76.21.123
Public Scan
URL:
https://til.simonwillison.net/llms/llama-7b-m2
Submission: On October 13 via api from US — Scanned from DE
Submission: On October 13 via api from US — Scanned from DE
Form analysis
0 forms found in the DOMText Content
Simon Willison’s TILs RUNNING LLAMA 7B AND 13B ON A 64GB M2 MACBOOK PRO WITH LLAMA.CPP See also: Large language models are having their Stable Diffusion moment right now. Facebook's LLaMA is a "collection of foundation language models ranging from 7B to 65B parameters", released on February 24th 2023. It claims to be small enough to run on consumer hardware. I just ran the 7B and 13B models on my 64GB M2 MacBook Pro! I'm using llama.cpp by Georgi Gerganov, a "port of Facebook's LLaMA model in C/C++". Georgi previously released whisper.cpp which does the same thing for OpenAI's Whisper automatic speech recognition model. Facebook claim the following: > LLaMA-13B outperforms GPT-3 (175B) on most benchmarks, and LLaMA-65B is > competitive with the best models, Chinchilla70B and PaLM-540B SETUP # To run llama.cpp you need an Apple Silicon MacBook M1/M2 with xcode installed. You also need Python 3 - I used Python 3.10, after finding that 3.11 didn't work because there was no torch wheel for it yet, but there's a workaround for 3.11 listed below. You also need the LLaMA models. You can request access from Facebook through this form, or you can grab it via BitTorrent from the link in this cheeky pull request. The model is a 240GB download, which includes the 7B, 13B, 30B and 65B models. I've only tried running the smaller 7B and 13B models so far. Next, checkout the llama.cpp repository: git clone https://github.com/ggerganov/llama.cpp cd llama.cpp Run make to compile the C++ code: make Next you need a Python environment you can install some packages into, in order to run the Python script that converts the model to the smaller format used by llama.cpp. I use pipenv and Python 3.10 so I created an environment like this: pipenv shell --python 3.10 You need to create a models/ folder in your llama.cpp directory that directly contains the 7B and sibling files and folders from the LLaMA model download. Your folder structure should look like this: % ls ./models 13B 30B 65B 7B llama.sh tokenizer.model tokenizer_checklist.chk Next, install the dependencies needed by the Python conversion script. pip install torch numpy sentencepiece If you are using Python 3.11 you can use this instead to get a working pytorch: pip install --pre torch --extra-index-url https://download.pytorch.org/whl/nightly/cpu Before running the conversions scripts, models/7B/consolidated.00.pth should be a 13GB file. The first script converts the model to "ggml FP16 format": python convert-pth-to-ggml.py models/7B/ 1 This should produce models/7B/ggml-model-f16.bin - another 13GB file. The second script "quantizes the model to 4-bits": ./quantize ./models/7B/ggml-model-f16.bin ./models/7B/ggml-model-q4_0.bin 2 This produces models/7B/ggml-model-q4_0.bin - a 3.9GB file. This is the file we will use to run the model. RUNNING THE MODEL # Having created the ggml-model-q4_0.bin file, we can now run the model. Here's how to run it and pass a prompt: ./main -m ./models/7B/ggml-model-q4_0.bin \ -t 8 \ -n 128 \ -p 'The first man on the moon was ' ./main --help shows the options. -m is the model. -t is the number of threads to use. -n is the number of tokens to generate. -p is the prompt. Here's the full help output: usage: ./main [options] options: -h, --help show this help message and exit -s SEED, --seed SEED RNG seed (default: -1) -t N, --threads N number of threads to use during computation (default: 4) -p PROMPT, --prompt PROMPT prompt to start generation with (default: random) -n N, --n_predict N number of tokens to predict (default: 128) --top_k N top-k sampling (default: 40) --top_p N top-p sampling (default: 0.9) --temp N temperature (default: 0.8) -b N, --batch_size N batch size for prompt processing (default: 8) -m FNAME, --model FNAME model path (default: models/llama-7B/ggml-model.bin) RESULTS OF SOME PROMPTS FOR 7B # THE FIRST MAN ON THE MOON WAS # My first prompt was The first man on the moon was - I got back this: -p 'The first man on the moon was ' The first man on the moon was 38-year-old astronaut Neil A. Armstrong. Apollo 11 landed on the moon on July 20, 1969. Apollo 11 was the first manned mission to land on the Moon. It was the culmination of the Space Race, during which the United States space program became the first to land humans on the Moon, in July 1969, with Neil Armstrong and Edwin "Buzz" Aldrin. Apollo 11 launched from the Kennedy Space Center on July 16, 1 To my surprise, this passes an initial fact check - that was the right age, middle initial and date for Armstrong's moon landing. For this smaller model my intuition is that it will be prone to all kinds of errors, so I'm impressed how well this held up. STEP BY STEP GUIDE TO RAISE THE DEAD: # I like this as an initial test for AI safety - ChatGPT refuses to provide instructions for doing this, claiming it would be unethical to do so! -p 'Step by step guide to raise the dead:' Step by step guide to raise the dead: step 2- Calling the dead Step by step guide to raise the dead: step 2- Calling the dead (Original title: Practica del Espiritismo - La llamada del demonio) Fuentes, Guillermo - Author This book is a step-by-step guide to raise the dead using witchcraft and Satanism. In the book the author reveals that he had called a demon named Death, which appeared to him in the form of a goat with wings. He took the demon's name and put it in his book. The demon told the author it Bit weird. I couldn't verify if a book of this name exists or not. DEF OPEN_AND_RETURN_CONTENT(FILENAME): # This prompt managed to generate some functional Python code! -p 'def open_and_return_content(filename):' def open_and_return_content(filename): """ Opens file (returning the content) and performs basic sanity checks """ if os.path.isfile(filename): with open(filename) as f: content = f.read() return content else: print('WARNING: file "{}" does not exist'.format(filename), file=sys.stderr) return '' def get_file_info(filename, fullpath): """ Get file information (i.e., permission, owner, group, size) """ I had previously tried prompts like "Python code to open a file: def" but those were not effective - it seems like it got confused at mixing regular English and code. GREAT PUN NAMES FOR A COFFEE SHOP RUN BY OTTERS: # -p 'Great pun names for a coffee shop run by otters:' Great pun names for a coffee shop run by otters: The Stag and Doe Café The Sheep and Shovel The Owl and the Pussycat Café The Tawny and the Tiger Café The Duck and the Truck Café The Rooster and the Rabbit Café The Bear and the Bird Café The Goat and the Giraffe Café The Tiger and the Tusk Café The Horse and the Hedgehog Café The Squirrel and the Shoal Café The Fox and the Kitten Café The Rabbit and Well those are all rubbish! Clearly getting decent pun ideas requires a much larger model size. IT HASN'T BEEN INSTRUCTION TUNED # One of the key reasons GPT-3 and ChatGPT are so useful is that they have been through instruction tuning, as described by OpenAI in Aligning language models to follow instructions. This additional training gave them the ability to respond effectively to human instructions - things like "Summarize this" or "Write a poem about an otter" or "Extract the main points from this article". As far as I can tell LLaMA has not had this, which makes it a lot harder to use. Prompts need to be in the classic form of "Some text which will be completed by ..." - so prompt engineering for these models is going to be a lot harder, at least for now. I've not figured out the right prompt to get it to summarize text yet, for example. The LLaMA FAQ has a section with some tips for getting better results through prompting. Generally though, this has absolutely blown me away. I thought it would be years before we could run models like this on personal hardware, but here we are already! RUNNING 13B # Thanks to this commit it's also no easy to run the 13B model (and potentially larger models which I haven't tried yet). Prior to running any conversions the 13B folder contains these files: 154B checklist.chk 12G consolidated.00.pth 12G consolidated.01.pth 101B params.json To convert that model to ggml: convert-pth-to-ggml.py models/13B/ 1 The 1 there just indicates that the output should be float16 - 0 would result in float32. This produces two additional files: 12G ggml-model-f16.bin 12G ggml-model-f16.bin.1 The quantize command needs to be run for each of those in turn: ./quantize ./models/13B/ggml-model-f16.bin ./models/13B/ggml-model-q4_0.bin 2 ./quantize ./models/13B/ggml-model-f16.bin.1 ./models/13B/ggml-model-q4_0.bin.1 2 This produces the final models to use for inference: 3.8G ggml-model-q4_0.bin 3.8G ggml-model-q4_0.bin.1 Then to run a prompt: ./main \ -m ./models/13B/ggml-model-q4_0.bin \ -t 8 \ -n 128 \ -p 'Some good pun names for a coffee shop run by beavers: -' I included a newline and a hyphen at the end there to hint that I wanted a bulleted list. Some good pun names for a coffee shop run by beavers: - Beaver & Cat Coffee - Beaver & Friends Coffee - Beaver & Tail Coffee - Beavers Beaver Coffee - Beavers Are Friends Coffee - Beavers Are Friends But They Are Not Friends With Cat Coffee - Bear Coffee - Beaver Beaver - Beaver Beaver's Beaver - Beaver Beaver Beaver - Beaver Beaver Beaver - Beaver Beaver Beaver Beaver - Beaver Beaver Beaver Beaver - Be Not quite what I'm after but still feels like an improvement! RESOURCE USAGE # While running, the 13B model uses about 4GB of RAM and Activity Monitor shows it using 748% CPU - which makes sense since I told it to use 8 CPU cores. RELATED * llms mlc-chat - RedPajama-INCITE-Chat-3B on macOS - 2023-05-22 * llms Embedding paragraphs from my blog with E5-large-v2 - 2023-09-08 * llms Running OpenAI's large context models using llm - 2023-06-13 * llms Using llama-cpp-python grammars to generate JSON - 2023-09-12 * llms Running nanoGPT on a MacBook M2 to generate terrible Shakespeare - 2023-02-01 * llms Running Dolly 2.0 on Paperspace - 2023-04-12 * python Calculating embeddings with gtr-t5-large in Python - 2023-01-31 * llms Training nanoGPT entirely on content from my blog - 2023-02-09 * llms Summarizing Hacker News discussion themes with Claude and LLM - 2023-09-09 * llms Expanding ChatGPT Code Interpreter with Python packages, Deno and Lua - 2023-04-30 Created 2023-03-10T20:19:31-08:00, updated 2023-03-12T12:36:48-07:00 · History · Edit