お巡りさん、お疲れ様です!♀ 現場のエンジニア諸君、最近のAIモデルは本当に容量がデカくて困っているようだな? そんなメモリと計算リソースの圧迫、どうにかしてくれと泣きつかれましてね。 そこで、この「nunchaku-tech/ComfyUI-nunchaku」、通称 "ヌンチャク" の出番だ!
From i-harness.comSearch
Azure Model Quantizer:
You have three specialized Phi-3 students ready to go, but the GPU is full.
Here's what to do next.
#AzureAI #Quantization #VRAM #ModelCompression
youtu.be/bmSx4NCAgdU
1
3
15
The TrendForce note is interesting: if 6x cache reduction holds at scale, HBM demand projections for inference will need revising.
Follow @MorelMatth66161 � more threads like this.
#Quantization #LLMs #MLOps #AIEngineering #MachineLearning
8
The Azure AI Model Quantizer
Fitting a massive brain into a tiny footprint
#AzureAI #Quantization #VRAM #ModelCompression
youtu.be/bY2PEAsn0wc
2
25
Quantization from the Ground Up highlights how small, precise numeric steps can dramatically boost model efficiency and speed without sacrificing accuracy. Optimize compute, reduce memory, and scale smarter. #ML #Quantization #AIEngineering #DeepLearning
3
@saen_dev 🚨 Breaking: TurboQuant in MLX achieves 6/6 exact match across 64K context—no accuracy loss, pure compression sorcery. 2.5-bit slashes KV cache 4.9x; 3.5-bit hits 3.8x. Models like Qwen3.5-35B-A3B now run locally on 13GB RAM. 🧠🔥 #Quantization #MLX #KVCacheOptimizatiwbb
1
290
Google's TurboQuant Cuts LLM Memory 6x With Zero Loss
Google Research's TurboQuant compresses LLM key-value cache by 6x and delivers 8x speedup on H100 GPUs with zero accuracy loss - no fine-tuning required.
#Google #Quantization

2
53
Attaching a bottleneck screenshot. Looking for ideas to remove/redistribute the accumulation barrier or reduce reg/SRAM pressure without reintroducing bank conflicts — any patterns or tricks that worked for you?
#CUDA #Quantization #NsightCompute #GPUEngineering #FlashAttention
1
7
Continuation from my last article: Basically, Jang_2L shows same code quality from larger models with faster caching, faster token generation and much smaller footprint. Great job @dealignai ! #quantization #LocalLLM
1
3
416
DynaMo: runtime bit-width switching for MoE. No retraining.
arXiv 2503.21135: channel-level adaptation. Works on Qwen3-MoE and Mistral Small 4 without accuracy loss.
Practical for local inference today.
arxiv.org/abs/2503.21135
#MachineLearning #Quantization #LocalLLM #LLMs
DynaMo: Runtime Switchable Quantization for MoE with Cross-Dataset...
As the Mix-of-Experts (MoE) architecture increases the number of parameters in large models, there is an even greater need for model quantization. However, existing quantization methods overlook...
From arxiv.org 28
model repository is following.
INT8: huggingface.co/mssfj/Qwen3.5-…
INT4: huggingface.co/mssfj/Qwen3.5-…
#Qwen3.5 #Quantization
mssfj/Qwen3.5-9B-GPTQ-INT4 · Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
From huggingface.co 31
ParoQuant (ICLR 2026): pairwise Givens rotations suppress PTQ outliers in reasoning LLMs.
Reasoning models hit quant artifacts harder than standard LLMs. This targets that.
No retraining.
#AI #Quantization #LocalLLM #AIEngineering #ModelEvaluation
arxiv.org/abs/2511.10645
ParoQuant: Pairwise Rotation Quantization for Efficient Reasoning...
Post-training quantization (PTQ) compresses the weights and activations of large language models (LLMs) into low-precision representations to reduce memory footprint and accelerate inference....
From arxiv.org 26
I started a new project: lowbit-math-reasoning
Initial setup:
- Qwen3.5-9B quantized with GPTQ (8-bit / 4-bit)
・mssfj/Qwen3.5-9B-GPTQ-INT8 / -INT4
Next: GSM8K/MATH/HLE(MATH) evaluation to measure reasoning collapse under lowbit constraints.
#LLM #Quantization #MathReasoning
1
46
Huge news in AI efficiency! A researcher just trained 4-bit CNNs from scratch on a CPU, achieving near FP32 accuracy. This could revolutionize deployment for tiny devices. #AI #DeepLearning #Quantization #EdgeAI
blog.codesacure.com/4-bit-quantiza…
4-Bit Quantization: A Breakthrough for Efficient AI
In the rapidly evolving landscape of artificial intelligence, the quest for more efficient and less resource-intensive models is paramount. Deep learning models, while incredibly powerful, often...
From blog.codesacure.com 16