A few more variants first

There are a few unexplored elements with rose5 that I need to explore now that I take stock of the full roster.

As it stands, the majority of consistency comes directly from cosine with hypersphere using penta as a variant form of vector lattice. It works yes, but it's also not what the models are supposed to be doing.

If left running too long, the variants show that the cosine primarily collapses to head-dependent, which means eventually the model eventually just falls into the classification state.

It seems only one loss is necessary, and that one loss is a combination of rose (multi cosine) mixed with alignment and margin losses.

If my hunch is correct, centroid may be much weaker than rose5 with all the losses except geometric rose turned off.

In any case, the model's objective needs to be restructured based on how this data responsed to the variants. Primarily the geometry and the patches need to be directly interdependent. The geometric pentachora need to learn and be an embedding gateway for additional information, as the frozen variants have shown they only provide meaning up to a certain point - and then that meaning is exhausted for small models. The meaning may be better than placbo but not by much. In fact; when cross entropy is introduce at high values, the meaning collapses into hinderance and the model simply learns to ignore the pentas, which is definitely not the goal.

The next version of this vit model will have gradient updates enabled with learning specifically allocated to the pentas in a few test methodologies. Some standard, some less standard and more geometric, some ccompletley bananas that probably won't work but I want to try them anyway.

I have quite a few formulas I need to try out, so as one chapter closes the next opens up - direct penta + patch learning using geometric classification only with feature analysis after rather than classification analysis.

This will enable a much larger pool of pentachoron; as it's going to build on a similar system as bert to create a similar abstraction as clip-vit with some variant changes and requirements that geometry needs.

Surgery again

Alright. This initial cycle is concluded, I've determined that the frozen pentachoron are in fact utilizable by about 50% most of the time and will eventually cap at 60% if you stick to standard cross-entropy with tons of geometric regularization. So roughly 3/5ths of the pentas are covered and the other two completely discarded.

I've begun analyzing feature sets for the variants at this point. The outputs are very promising in terms of potential, even if the methods used to calculate accuracy were... well suboptimal.

I've learned that cross-entropy + geometry is a definite no-go. The pentas exist below zero, and the entire premise of their formulas exists to revolve around a dynamic position - so the entire premise of below-zero being the absolute normality for classification with logits is definitely... not something that cross-entropy agrees with.

In any case, determining feature analysis tools is crucial today. I need to figure out exactly what's in them to properly calculate a variant that can actually be analyzed correctly, because these losses aren't cutting it.

I've defaulted to a singular global entropy, simplex loss, geometry loss, and a few other potentials that can be used to train the clip-vit variant - but the bucketing itself - the 100 pentas, need to be correctly assessed and calculated before a larger variant can be created.

There's more than enough weights, I just need to smash math today to see if I can project a proper lattice that conforms to the CM + Graham principles without restructuring the entire set of weights.

It's possible that the accuracy is much higher than expected and that I'm simply not asking them the right question. That would be quite the thing wouldn't it.

Instead of 5-10 losses like I'm currently experimenting with, I would think 1-3 would be the correct amount. The current losses are often conflicting so they must be normalized, and a proper optimizer with ema must be constructed.

Breakdown and assessment

Using the standard vit position tokenization doesn't work. It caps our unique feature map representation to 65 tokens worth based on patch4, and due to the high-dimensional geometry bloating dimensions to such a degree - the representative patches can only be learned to a certain degree of accuracy before they simply run out of space and the model defaults to memorizing the training data.

Probe diagnostics show;

They are in fact forming unique separated clusters of similar assessment, and the 2d graph shows they are highly diverse;

So the process works, but the process simply does not have enough compartmentalization to fully represent a classified patch within a classification within the current set of parameters.

I am currently devising a better and more directly intertwined structure that represents the baseline of geometry directly entangled with the vit patches rather than trying to shape it using another system indirectly.

This direct representation will be a bit more volatile at first but it should help solidify the missing 40% accuracy in a utilizable and repeatable way to extract the necessary patches.

After a big notebook refactor

I have pushed the updated model code, and included the loader. I will not include the losses or the training methodology until the full process is prepared and the paper published. After which you will see exactly what I've developed and why each piece exists. Until then there are only breadcrumbs and inference code.

I released a new version of eval with the new version of the model code.

Model load/save code has been streamlined, so it should correctly include the variant information each checkpoint now.
Multiple formula quirks that were contributing to invalidity and incorrect truths, contributing to negation
Cascading errors from zero due to silent unseen internal model deviance which have been corrected with careful entropy usage
Faulty contributions from multiple highly-responsible losses required to sustain complexity while introducing variance.
Integrated the cutmix again which had been omitted due to instability with the earlier variant.

Tech debt smashed.

Okay next up; the last system's variant appeared to be capped at around 55% no matter the size. With the correct formulas this still may not be sufficient. More than likely the entire feature will need to be reimagined, the patch size altered to 16, and the full imagenet 256 variant trained.

First though, the small one has to be cohesive enough.

Note

This is custom code to load/save the models. Be sure to always review custom code from any source before running it in a project.

Likely reintroduce the theta head tomorrow

The theta trains were actually not that bad. The head added some overhead but not really that much and the outcome improved, so it's worth exploring more.

Currently the l1 trains are performing well but still not up to the required 85% that I'm aiming for. Today's trains were underwhelming, but enlightening. Longer models aren't helpful in this structure any more than wide models are.

Reintroducing theta with some of my diffusion techniques might be in order if I can't get this one to comply. I'll try a couple of projection tricks before I go start digging into other experiments, but as it stands this one isn't yielding. Training an expert with pure data isn't exactly the geometry's forte - training students tends to yield much more effectively in comparison, but I'm trying to make a teacher model that can actually train students with geometry here.

To be fair, if it doesn't work, there's plenty of alternative options in the vit realm already - but I have high confidence that I can make it work. I just need to read more about capturing images, and treating the pentachora more as observers rather than direct relational interaction toolkits.

It'll likely need the constellation, but we'll see. It probably needs David's expert system.

Still about the same accuracy deep as shallow

It's not a capacity issue YET then, since that should have covered it with shaper.

I have a few ideas but I think I'll focus on getting more shallow models stable and then scale up slowly instead of trying to just using a logarithm.

Formulas have been purified

The newest vit_zana_nano train has shown a very clean curve. runs/vit_zana_nano/20250913_192119 which is about a 3 meg model capable of creating >50% accuracy 128 dim features from cifar100 classification.

This clean curve means the process is stable enough to introduce a larger set of depth and blocks without destroying the internals; simultaneously enforcing the 5 loss formulas specifically curated for the pentachora math.

I've begun training a much deeper zana dubbed vit_zana_shaper. This model has 32 blocks deep with MLP ratio of 1 and 2 attention heads, resting at about 3.5 million params or so.

Lets see how she fares.

I've had an epiphany. We don't NEED transformer layers in their current form.

David's architecture already solved this need with high-efficiency multi-stage geometric mathematics.

David's classification structure houses a series of dimensional projection sub-systems tasked with learning mastery based on each pentachoron structure.

Each of those 5d representations ends up learning thousands of representative features. David is already capable of feature generation just not robust enough to fully manifest an enriched ViT-grade dimensional feature... yet.

David's architecture can handle ImageNet's classifier count and features leveraging 1000 classes with ease, sitting on a floppy disk at over 70% accuracy because David sees Clip-Vit-Base-Patch16 features.

I believe I've figured out a way to fundamentally represent those features in a meaningful way that can replace transformer layers in their methodology with a different form of feedforward trajectory, edge, point, deviation, jitter, helix, theta, and similarity assessment that should house the needed information to teach the experts how to behave like David did.

This should allow the much larger networks to retain mathematical precision, learn the features in a different form of patch than is currently expected to be a patch, and to create legitimate high-density geometic features.

Better rope incoming with actual meaningful learning

The last one wasn't meaningfully learning representations, the next should be more correctly curated and inferenced to impact representative outcome. Should be a bit more accurate than the last but no guarantees.

I once again let AI handle it for me and now I'll need to go micro manage again. This is on me again, you'd think I would learn. Oftentimes they can handle these sorts of tasks and other times... well other times they just kind of hook shit together and say it works, then it spins in circles.

Time for my favorite thing, research papers.

It's starting to look more like a glorified branched FFN rather than a MLP, so that's a thing I suppose.

Theta rope seems less accurate than the penta head

There's an experimental classification theta rotation head with multiple pentachora routes. The results are less accurate overall than the similarity through rose without it so far. Experiments ongoing.

I assumed full control from the AIs and built it correctly.

I was relying too much on the AI and it made me slow. Today I assumed full control and built the models correctly. The architecture is cleaner and all three python files were uploaded for the v3 setup.

vit_zana_small already seeing 50% by epoch 50, is a big step up from the earlier pixies hard locked at 41%.

Zana the current version is quite small and quite fast

At about 500k the zana_nano competes with it's big sister pixie at a superior accuracy rating AND produces image features.

Running the system with refined wordnet tokens rather than full unicode made all the difference. The findings show that meaningful semantics matter a whole lot.

unicode; 21%
same model
wordnet_eng; >42%

All losses modified heavily, the originals did not work at all with the structure.

V3 incoming.

Pushing HEAVILY into losses based on the WORKING high-entropy high-learn rate classification heads and forcing this thing into cohesion INSTANTLY.

Thats the play. No more 200 epochs. These things should be ready in 10-20 epochs at most, and they should be 80%+ accuracy, or they fail. Those are the two potentials here.

With correct logit and probe assessment the substructure should be a profoundly more efficient and easily analyzable series of charts based on similarity for assessments and capability. None of this guessing or guesswork based on "what works with other models" We KNOW what works and I should have never second guessed the formulas.

I have implemented all of the most crucial and most powerful formulas from the others, now lets see if the universe makes a fool of me or not.

If it does, SO BE IT! Lets build an AI singularity empire from there.

We're about to teach a VIT diffusion. The real question is, will it learn - or will it collapse and need dual-block layers from Flux?

Better testing methodology development

I'm reading up on some papers for how various companies and research institutions tested their VITS. My testing methodology isn't accurate enough because the accuracy isn't just reflecting on the logit alignments but also the internal ML layer feature generations.

I'm crutching heavily on the logit alignment instead of managing the feature alignment testing as well, which is likely cutting heavily into my system.

Currently I'm building a notebook with the better feature testing capabilities to test features correctly. I anticipate faster trains when the confidence actually starts to pick up, since currently they are not confident at all in terms of classification.

It's possible these vits could be potentially MUCH MORE or MUCH LESS accurate then advertise and I apologise for the inconvenience this has caused to any onlookers. I'll be updating with additional inference code very soon.

Tinkerbell

128d 128heads 4.0 mlp, depth 4 only geometric attention...

Well it might work. I could make it smaller, but I doubt tinkerbell would extract anything useful. Good luck little one.

Enabling the Mix-N-Cut

I've built a mix-n-cut that I've been avoiding enabling. This one is particularly formatted for pentachoron, so we'll see how it fares. I'm trying to build one as SMALL AS POSSIBLE< so if this mix-n-cut can pull the task out of the bag I may as well run it.

As it stands the tiny vits cap at 41% cifar100 with no augmentations. I've been running all the trains without a single special effect and only minimal normalization.

Lets see how the upcoming trains fare.

pixie_base_128d_patch4_128h

Pixie base has 10 layers with 5 goemetic and 5 multihead traditional attention. Lets see how the mix-n-cut fares with the earlier ones first, then we'll run the base.

The smaller ones seem to behave better using the geometric attention at 256 expert heads, which is odd to me but whatever works. They don't get much bigger with more experts, so I'll just try a tiny one with a ton of heads first.

Pentachoron Geometric Feature Extraction

Pentachora VIT are essentially micro-sized feature extractors that provide substantial accuracy for their small size. The more experiments I run, the smaller they become. The final goals to be a full clip-vit that can house the entirety of laion 400m in a fraction of the size and compute as OpenAI's clip-vit line. After this point I'll be confident the math is lined up well enough to train the true flagship - Beatrix.

The process of useful classification and feature extraction has been a non-trivial problem in the Computer Science industry for a long time.

This repo will house the various vit experiments that I frankenstein together; manifesting their weights and model codes in the repo itself.

As I am an independent researcher my resources are limited and I don't have the backing of any donors, so there will be time gaps unless some hardware is sliced off for me.

Many of my repos have certain elements omitted purposely for papers in writing, my thesis arguments, my statements about certain universal elements, and a multitude of other ramblings that I don't plan to release specific key details in full phonebook fashion for just ANY PERSON to read.

Let me use your high-end hardware. I deliver - success or failure, but I will deliver.

I will not rattle a tin cup for you. Work out a deal with me and you get the weights - I get the classes developed for further use, meant for public release.

Let me know if you're willing to work with me. I'll gladly share the code, the process, the progress, and the built accumulated warchest of potentials that this system entails if you provide me gateways to some hardware that I can utilize.

Until then, one experiment at a time.

Downloads last month: -; Downloads are not tracked for this model. How to track

Dataset used to train AbstractPhil/penta-vit-experiments

Collection including AbstractPhil/penta-vit-experiments

Flagships

Collection

My flagship models that actually work or are the best I have capable from a category currently. • 10 items • Updated 17 days ago