How is this different than ollama (https://github.com/jmorganca/ollama)? I would argue it's even simpler to run LLMs locally with ollama
Ollama's approach with Docker is indeed user-friendly, but Rust + WebAssembly (Wasm) offers some cool perks that are worth considering. With Wasm, you get true cross-platform support, meaning it runs smoothly on various devices, automagically uses the local hardware accelerator to run at full speed. It's also super lightweight, making it faster and more resource-efficient – a big plus for running LLMs, especially on devices with limited resources. A Wasm inference application can be as small as 2MB, including all dependencies, which is significantly smaller than most Docker containers.
Mistral have created a docker image which hosts their model in vllm. Vllm creates an openai like http API interface.
While Mistral's Docker image provides a practical solution, it's important to note that Docker isn't truly cross-platform. A Docker image compiled for x86 architecture won't run on ARM, which can be a limitation for diverse hardware environments. On the other hand, Rust + WebAssembly (Wasm) offers genuine cross-platform capabilities. Wasm is not only faster to start and more resource-efficient but also more secure. This makes it exceptionally suitable for handling large language models, ensuring high performance while being cost-effective. In scenarios where diverse hardware support and efficiency are crucial, Rust + Wasm emerges as a superior choice over Docker.
Does anyone have any feedback on using those open-source models with any language except English? Particularly non-western group of languages like korean/japanese/chinese?
Will assess myself but wonder if anyone tried.
For Chinese, there are this Yi Model by 0.1 AI
Yi 34B https://www.secondstate.io/articles/yi-34b/
Uncencored Yi: https://www.secondstate.io/articles/dolphin-2.2-yi-34b/
and also Baichuan https://www.secondstate.io/articles/baichuan2-13b-chat/
There is an Arabic one called Jais mentioned by Satya a few days ago on the Microsoft dev day would like to try out
A bit of a hit and miss really. I haven't fiddled with languages all that much, but I was kinda curious to try a bunch of models with Spanish(the only large language other than English in my toolbox) about a month or two ago. Aguila 7B was the one that really stood out but then again it shouldn't come as a surprise - it is practically Falcon 7B trained on Spanish data.
It's probably better to switch to suitable language-specific models. GPTs sound translated in non-native languages, and also has difficulty separating CJK due to that Unicode quirk.
They work, but they need way more tokens to express themselves. Still, it beats paying OpenAI for the privilege.
Alibaba's Qwen is excellent for Chinese.
What are Mistrals strengths and weaknesses? I tried it for infrastructure as code and it wasn't able to output more than the most basic examples, let alone modify them.
Mistrals main strength is that it is a good allrounder model with a 7B size, so it runs on a lot of consumer hardware. If you are looking for specific tasks to run locally (with decent RAM and no GPU), the realistic best option in most categories is a fine-tuned 13B model.
I've tried using LLMs in general with IaC (K8s resources/Helm charts), and they all did well when asking it a very specific thing about it (e.g. "Is there an alternative way to accomplish X?"), but also never had it perform well when outputting YAML directly (either from scratch of modifying it).
Which did you try? Many of these models don't work well at all for queries like "please do this task for me". But then people fine tune them and they work much better.
Give zephyr a try (available in ollama and similar places)
It's a fine tune of mistral and works quite well.
But as you point out, these models have less general knowledge compared to their massive siblings. Knowledge based queries are going to be lower quality than using gpt-4.
Excellent at creative writing
Pretty good at instructions
Uncensored, no RHLF bs
I never heard of IaaC before.
Thanks for sharing this!
I read somewhere that Mistral 7B had a similar performance to GPT-3, but it seems to be miles behind it unfortunately.
I tried this fine tune of it https://huggingface.co/teknium/OpenHermes-2.5-Mistral-7B and was impressed. It seems to be one of the better performing models on the LLM leaderboard https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderb...
Remember GPT-3 != ChatGPT. I could perhaps believe it is similar to GPT-3, but it's certainly miles behind ChatGPT even with 3.5
llama.cpp can run the Q4 variant of the same at 30tok/s on an M1 Pro unlike the 20tok/s being quoted
I get similar speeds on my m1, though it depends a bit on the quantization. If anyone else is on Mac and just want to have a play, FreeChat (my app) is a simple chat wrapper for llama.cpp. Clicking the little … on a message and checking the info will tell you the generation speed.
I really like the post that they mention (https://www.secondstate.io/articles/fast-llm-inference/). The reasons for avoiding python all resonate with me. I'm excited to play with WASI-NN (https://github.com/WebAssembly/wasi-nn) and that rust code is very readable to load up a GGUL model.
Related discussion: 38246668
It seems a lot slower than ollama / LM Studio. Is that likely because it’s not optimised as much or a limitation of some sort?
The demo looks too slow for practical usage. How much it will cost if i host it in cloud to get instant response similar to the speed of openAI?
The demos are as fast as the ChatGPT, they just look slower. -First. It might seem slow at first because it's loading the model. But once that's done, it gets much quicker for any follow-up requests. Second, Turning on streaming output helps a lot, as it shows responses as they're processed. And yes, the hardware matters too. A good GPU setup in the cloud can work wonders, though it might bump up the cost a bit. So actually the demo's not slower than ChatGPT overall; it just takes a moment to warm up at the start.
https://github.com/second-state/WasmEdge-WASINN-examples/tre... These are the speed of different hardwares. You can rent GPU, which will be much faster
Impressive use of wasmedge, great to see ML projects getting away from python for efficiency's sake, thanks for sharing!
The WasmEdge README gives me the heebie-jeebies. Starry-eyed emojis, highlights use-case for today's most trendy thing even though it's general-purpose, mentions blockchain. This reeks of former cryptobros chasing the next big thing. I'd trust Wasmtime more.
You do realize HuggingFace is named after an emoji?
> With Wasm, you get true cross-platform support, meaning it runs smoothly on various devices, automagically uses the local hardware accelerator to run at full speed.
No... no, that is not what Wasm means at all.
> you get true cross-platform support
wasi_nn is just an abstraction layers for wrappers around llama.cpp/gguf.cpp/onyx/others. It is not any more cross-platform than these libraries/apps deployed themselves, or rust apps using bindings for them.
Ollama doesn't use Docker by default. While it is one of the install/download options, the default mode of operation is installing it as a native application (where they provide prebuilt binaries for the main platforms/architectures) into which you can easily load models from HuggingFace repos or local GGUF files.
I don't understand. When you run `ollama pull` (or even `ollama run` which will pull if you don't have the model), that is using docker is it not? The code even references pulling from a registry.
Is there any data to back up the "faster to start and more resource-efficient", especially if you compare it to the native non-Python solutions that most people are using to run LLMs on local machines?
I'm as big of a fan of Rust and WASM as the next person, but throwing around claims like that without benchmarks is one of the quickest ways to get your product dismissed.
Faster is compared to Python. Portable, more secure and lightweight are compared with Python and other native solutions. In terms of benchmarks, Rust / C++ is 50,000x faster than Python; WasmEdge runtime + portable app is 30M compared with 4G Python and 300MB llama.cpp Docker image that is NOT portable across CPU or GPU; Wasm sandbox is more secure than native binary.
Usually when you’re building container images at any scale you’d use something like buildx, kaniko or buildah which allow you to easily set multiple target architectures (AMD64, ARM64) for your images.
There’s hardly any / basically no overhead to running an application in a container if that’s what you mean by overhead. If you mean the image size - well you only need to add the things you need - the problem is people tend to abuse images and install a lot of packages in the final image which absolutely aren’t required.
Rust or golang at the core would be nice as Python can be slow at times. I do wish more folks would give Tauri a go for GUI apps.
Yi has weird license
emmm do you mean not truly open source
For Qwen there is this the CausualLM 14B model, based on llama2 architecture but with Qwen 14B model weights.
Oh thanks, I didn't know about that. Will check it out!
I tried the basic Mistral Instruct model. I'll give Zephyr a try, thanks for the recommendation!
Tell me how can I run it locally.
Ashu, i am going to guess that english is not your first language, apologies, if that is not the case, but " Tell me how can I run it locally." is not a great way to ask for info, not here at least. A 'please' would certainly help, but a less direct way might be even better.
Anyway, to answer your question: https://ollama.ai/
I assume you mean IaC? It’s how you manage infrastructure with automation. Terraform, Ansible, CDK etc… back in the day it used to be for managing pets with Pupper, Chef and friends.
I was talking about GPT-3, which I have interacted with via the developer API.
To be fair to the developers of Mistral, it's still amazing that I can run this on my computer, and it's certainly better than the first LLaMA.
The emojis themselves aren't a problem. I love silliness. They just seem to be correlated with rugpulls.
I haven't looked to deeply into the code, but it looks to me like it just uses OCI/Docker images as a distribution format.
You can find some evidence of that in the model directory (mentioned in the FAQ), where you can find manifest files that have contents of a docker manifest.
Using this format likely makes distribution a lot easier as they can just use an existing OCI registry for hosting the images, and you can use existing push/pull primitives. OCI images are often used nowadays for things besides docker images if you have a format (like a layered filesystem) that is nicely content-addressable.
The default registry for Ollama is https://ollama.ai/library (I don't know about the API endpoint). I'm not sure if/how you can actually use it with another registry endpoint.
When people say that something "uses Docker" that usually involves utilizing the Docker daemon with it's sandboxing, networking, etc.. Ollama doesn't do or need to do that as it can just do inference natively based on GGUF files (+ the other configuration that comes as part of their model file).
However, as you have noticed, it does borrow heavily from Docker on a conceptual level. E.g. apart from the registry mechanism its Modelfile heavily mimics a Dockerfile.
ollama maintains a registry of ollama compatible llms. I don't see any indication that they use docker for that registry, but I could have overlooked something. I also don't see any docker/podman containers running when using ollama.
Thanks for clarifying!
You must send them an email to get commercial use permission. It does not look like a lawyer wrote their license either
I think you have to do similar things with Baichuan too. Which is also a bit pseudo open source