Projects with this topic
-
Intelligent VRAM/RAM swapping for LLM inference - Extension of KVortex | Offloading intelligent VRAM/RAM pour l'inference
Updated -
Automated LLM Benchmarking on GPU - tokens/sec, latency percentiles, VRAM profiling, multi-format support (HuggingFace, GGUF, GPTQ)
Updated -
VRAM to RAM Offloader for AI and vLLM - High-Performance C++23 KV Cache Engine with Multi-Stream GPU Transfers
Updated -
Extreme KV Cache Compression for LLM Inference — C++17/CUDA implementation of TurboQuant (arXiv 2504.19874). 7.5x compression, <2% quality loss.
Updated -
Hive is a peer-to-peer system that distributes AI inference tasks across volunteer workers ("bees") running local or cloud LLMs. Send the same task to multiple bees in parallel, then automatically merge their outputs over several rounds to make small models smarter together. Built with Rust, Tauri, and libp2p.
Updated -
In this project, we discard the hypothesis of a discrete distribution for the probability of the words and look for the proper correction to exchangeability to better attribute books to authors.
Updated -
Evaluation of Fast, Faster and Mask R-CNN regarding their inference times
Updated