#Quantization — Search

No JavaScript? That's cool, but you'll need to disable Turbo mode as it uses JavaScript in the client.

AI捜査官、ヌンチャクを語る！ ComfyUIプラグイン活用の手引 i-harness.com/articles/nunch… #comfyuinunchaku #flux #quantization

flux - AI捜査官、ヌンチャクを語る！ ComfyUIプラグイン活用の手引 - quantization - diffusion

お巡りさん、お疲れ様です！‍♀ 現場のエンジニア諸君、最近のAIモデルは本当に容量がデカくて困っているようだな？そんなメモリと計算リソースの圧迫、どうにかしてくれと泣きつかれましてね。そこで、この「nunchaku-tech/ComfyUI-nunchaku」、通称 "ヌンチャク" の出番だ！

From i-harness.com

Zorko @Zorko__ · 23h

Replying to @Zorko__

#MachineLearning #LLM #Quantization #BitNet #NLP #AI #OpenSource

techreader @techreader22587 · 1d

Azure Model Quantizer: You have three specialized Phi-3 students ready to go, but the GPU is full. Here's what to do next. #AzureAI #Quantization #VRAM #ModelCompression youtu.be/bmSx4NCAgdU

Matthieu Morel @MorelMatth66161 · 1d

Replying to @MorelMatth66161

The TrendForce note is interesting: if 6x cache reduction holds at scale, HBM demand projections for inference will need revising. Follow @MorelMatth66161 � more threads like this. #Quantization #LLMs #MLOps #AIEngineering #MachineLearning

techreader @techreader22587 · 1d

The Azure AI Model Quantizer Fitting a massive brain into a tiny footprint #AzureAI #Quantization #VRAM #ModelCompression youtu.be/bY2PEAsn0wc

Axiopistis Holdings LC @axiopistis · 1d

Quantization from the Ground Up highlights how small, precise numeric steps can dramatically boost model efficiency and speed without sacrificing accuracy. Optimize compute, reduce memory, and scale smarter. #ML #Quantization #AIEngineering #DeepLearning

Telos @teloslab · 3d

Replying to @saen_dev

@saen_dev 🚨 Breaking: TurboQuant in MLX achieves 6/6 exact match across 64K context—no accuracy loss, pure compression sorcery. 2.5-bit slashes KV cache 4.9x; 3.5-bit hits 3.8x. Models like Qwen3.5-35B-A3B now run locally on 13GB RAM. 🧠🔥 #Quantization #MLX #KVCacheOptimizatiwbb

290

Awesome Agents @awagents · 3d

Google's TurboQuant Cuts LLM Memory 6x With Zero Loss Google Research's TurboQuant compresses LLM key-value cache by 6x and delivers 8x speedup on H100 GPUs with zero accuracy loss - no fine-tuning required. #Google #Quantization

Matt J. Borowski @MattJBorowski · 4d

Replying to @MattJBorowski

Attaching a bottleneck screenshot. Looking for ideas to remove/redistribute the accumulation barrier or reduce reg/SRAM pressure without reintroducing bank conflicts — any patterns or tricks that worked for you? #CUDA #Quantization #NsightCompute #GPUEngineering #FlashAttention

李沅 Allen Lee @allenwlee · 4d

Continuation from my last article: Basically, Jang_2L shows same code quality from larger models with faster caching, faster token generation and much smaller footprint. Great job @dealignai ! #quantization #LocalLLM

李沅 Allen Lee @allenwlee · 4d

416

Matthieu Morel @MorelMatth66161 · Mar 21

DynaMo: runtime bit-width switching for MoE. No retraining. arXiv 2503.21135: channel-level adaptation. Works on Qwen3-MoE and Mistral Small 4 without accuracy loss. Practical for local inference today. arxiv.org/abs/2503.21135 #MachineLearning #Quantization #LocalLLM #LLMs

DynaMo: Runtime Switchable Quantization for MoE with Cross-Dataset...

As the Mix-of-Experts (MoE) architecture increases the number of parameters in large models, there is an even greater need for model quantization. However, existing quantization methods overlook...

From arxiv.org

mssfj @_mssfj · Mar 20

Replying to @_mssfj

model repository is following. INT8: huggingface.co/mssfj/Qwen3.5-… INT4: huggingface.co/mssfj/Qwen3.5-… #Qwen3.5 #Quantization

mssfj/Qwen3.5-9B-GPTQ-INT4 · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

From huggingface.co

Matthieu Morel @MorelMatth66161 · Mar 18

ParoQuant (ICLR 2026): pairwise Givens rotations suppress PTQ outliers in reasoning LLMs. Reasoning models hit quant artifacts harder than standard LLMs. This targets that. No retraining. #AI #Quantization #LocalLLM #AIEngineering #ModelEvaluation arxiv.org/abs/2511.10645

ParoQuant: Pairwise Rotation Quantization for Efficient Reasoning...

Post-training quantization (PTQ) compresses the weights and activations of large language models (LLMs) into low-precision representations to reduce memory footprint and accelerate inference....

From arxiv.org

mssfj @_mssfj · Mar 18

I started a new project: lowbit-math-reasoning Initial setup: - Qwen3.5-9B quantized with GPTQ (8-bit / 4-bit) ・mssfj/Qwen3.5-9B-GPTQ-INT8 / -INT4 Next: GSM8K/MATH/HLE(MATH) evaluation to measure reasoning collapse under lowbit constraints. #LLM #Quantization #MathReasoning

Akash Motghare @codesacure · Mar 18

Huge news in AI efficiency! A researcher just trained 4-bit CNNs from scratch on a CPU, achieving near FP32 accuracy. This could revolutionize deployment for tiny devices. #AI #DeepLearning #Quantization #EdgeAI blog.codesacure.com/4-bit-quantiza…

4-Bit Quantization: A Breakthrough for Efficient AI

In the rapidly evolving landscape of artificial intelligence, the quest for more efficient and less resource-intensive models is paramount. Deep learning models, while incredibly powerful, often...

From blog.codesacure.com