16 9 3

abubakar

stormchaser

AI & ML interests

None yet

Recent Activity

new activity 3 days ago

nvidia/NitroGen:Slow/lagging

upvoted a collection about 1 month ago

Olmo 3

commented on an article about 1 month ago

Building for an Open Future - our new partnership with Google Cloud

View all activity

Organizations

None yet

New activity in nvidia/NitroGen 3 days ago

Slow/lagging

#1 opened 4 days ago by

TrenchantWits

upvoted a collection about 1 month ago

Olmo 3

Collection

Artifacts for the Olmo 3 release. • 9 items • Updated 1 day ago • 156

commented on Building for an Open Future - our new partnership with Google Cloud about 1 month ago

this is all good for a lot of people. Just please don't get acquired by Google, thats all 🤗

New activity in KingNish/Realtime-whisper-large-v3-turbo 4 months ago

whister-real-time

#7 opened 5 months ago by

IKyzo

updated a model 7 months ago

stormchaser/Nemotron-Research-Reasoning-Qwen-1.5B-GGUF

2B • Updated Jun 3 • 14

published a model 7 months ago

stormchaser/Nemotron-Research-Reasoning-Qwen-1.5B-GGUF

2B • Updated Jun 3 • 14

upvoted an article 7 months ago

Article

nanoVLM: The simplest repository to train your VLM in pure PyTorch

May 21

•

244

upvoted an article 10 months ago

Article

Fixing Open LLM Leaderboard with Math-Verify

Feb 14

•

reacted to bartowski's post with 👀👍 about 1 year ago

Post

80521

Looks like Q4_0_N_M file types are going away

Before you panic, there's a new "preferred" method which is online (I prefer the term on-the-fly) repacking, so if you download Q4_0 and your setup can benefit from repacking the weights into interleaved rows (what Q4_0_4_4 was doing), it will do that automatically and give you similar performance (minor losses I think due to using intrinsics instead of assembly, but intrinsics are more maintainable)

You can see the reference PR here:

https://github.com/ggerganov/llama.cpp/pull/10446

So if you update your llama.cpp past that point, you won't be able to run Q4_0_4_4 (unless they add backwards compatibility back), but Q4_0 should be the same speeds (though it may currently be bugged on some platforms)

As such, I'll stop making those newer model formats soon, probably end of this week unless something changes, but you should be safe to download and Q4_0 quants and use those !

Also IQ4_NL supports repacking though not in as many shapes yet, but should get a respectable speed up on ARM chips, PR for that can be found here: https://github.com/ggerganov/llama.cpp/pull/10541

Remember, these are not meant for Apple silicon since those use the GPU and don't benefit from the repacking of weights

17 replies

updated a collection about 1 year ago

papers

Collection

3 items • Updated Dec 7, 2024

upvoted a paper about 1 year ago

Star Attention: Efficient LLM Inference over Long Sequences

Paper • 2411.17116 • Published Nov 26, 2024 • 53

commented a paper about 1 year ago

BlueLM-V-3B: Algorithm and System Co-Design for Multimodal Large Language Models on Mobile Devices

Paper • 2411.10640 • Published Nov 16, 2024 • 46 •

New activity in KingNish/Realtime-FLUX about 1 year ago

Красивая девушка и молодой юноша. Свидание в отеле. Кровать.

🔥 2

#9 opened about 1 year ago by

Asanych

updated 2 collections over 1 year ago

papers

Collection

3 items • Updated Dec 7, 2024

spaces-fav

Collection

1 item • Updated Apr 24, 2024

updated a model over 1 year ago

stormchaser/llava-llama-3-8b-v1_1-Q6_K-GGUF

Image-Text-to-Text • 8B • Updated Apr 23, 2024 • 28

reacted to akhaliq's post with 👀 over 1 year ago

Post

4253

Mixture-of-Depths

Dynamically allocating compute in transformer-based language models

Mixture-of-Depths: Dynamically allocating compute in transformer-based language models (2404.02258)

Transformer-based language models spread FLOPs uniformly across input sequences. In this work we demonstrate that transformers can instead learn to dynamically allocate FLOPs (or compute) to specific positions in a sequence, optimising the allocation along the sequence for different layers across the model depth. Our method enforces a total compute budget by capping the number of tokens (k) that can participate in the self-attention and MLP computations at a given layer. The tokens to be processed are determined by the network using a top-k routing mechanism. Since k is defined a priori, this simple procedure uses a static computation graph with known tensor sizes, unlike other conditional computation techniques. Nevertheless, since the identities of the k tokens are fluid, this method can expend FLOPs non-uniformly across the time and model depth dimensions. Thus, compute expenditure is entirely predictable in sum total, but dynamic and context-sensitive at the token-level. Not only do models trained in this way learn to dynamically allocate compute, they do so efficiently. These models match baseline performance for equivalent FLOPS and wall-clock times to train, but require a fraction of the FLOPs per forward pass, and can be upwards of 50\% faster to step during post-training sampling.

1 reply

upvoted a collection over 1 year ago

MGM

Collection

Official model collection for the paper "Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models" • 13 items • Updated May 3, 2024 • 47

abubakar

AI & ML interests

Recent Activity

Organizations

stormchaser's activity

Slow/lagging

whister-real-time

nanoVLM: The simplest repository to train your VLM in pure PyTorch

Fixing Open LLM Leaderboard with Math-Verify

Красивая девушка и молодой юноша. Свидание в отеле. Кровать.