This model is a preview, unfinished, and still in development. It is not representative of any final product and has only been remotely published to prove that I am doing something productive with my life

Koto Large 106B-a6B (Preview)

Koto-Large-106B-Preview is a version of Ling-Flash-Base-2.0 trained on almost a billion tokens of creative writing data.

Thanks to lium.io for the compute! <3

Usage

Um. Don't please?

But if you must, our testers found that:

Temp 1.1, min_p 0.01, rep pen 1.02, freq pen -0.04

is the best. somehow

Datasets

Some of the data used to train this model includes:

Most of The Anarchist Library, a repository for anarchist manifestos and writing (see allura-org/the-anarchist-library)
A random sample of public domain books from Project Gutenberg
Furry (anthro and feral) storytelling and smut
A small subset of known high-quality books and story data

Acknowledgements

thanks again to fish and co from lium for compute
thanks to curse for testing, ideas
thanks to toasty for some data, ideas
thanks to everyone else in allura for moral support

ilya <3

Technical Appendix

Training Notes

Yeah idek what went wrong here, I'll be for real. It's... oddly really stupid, though it reportedly outputs good prose sometimes. It's similar to GLM-4 32B's base in that nature.

Took ~18hrs on 8xH200. intervitens had already converted the model to use a faster MoE layer for training and I further patched it to use CCE to bring down memory by a little bit.

TorchAO's 8bit Adamw was used for optimization. FSDP was utilized for model sharding (it's a lot more stable and supported than Deepspeed, ime)

Plans for Next Time

I have three theories as to what went wrong:

MoE training is busted (possible, but the 0.25 epoch checkpoint looked oddly promising, so I don't think that's it)
The model did not get trained on enough data
The model was not trained aggressively enough

I have a feeling that higher LR and/or doing more epochs on the data would result in a much better end result.

Axolotl Config

base_model: /root/ling-scm
trust_remote_code: true

load_in_8bit: false
load_in_4bit: false
strict: false

# < -- Saving -- >
output_dir: ./koto-106B-a6B
saves_per_epoch: 4


# < -- Vram Savings -- >
#gradient_checkpointing: true
flash_attention: true

fsdp:
  - auto_wrap
  - full_shard
fsdp_config:
  fsdp_version: 2
  fsdp_offload_params: false
  fsdp_cpu_ram_efficient_loading: true
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_transformer_layer_cls_to_wrap: BailingSharedMoeV2DecoderLayer
  fsdp_state_dict_type: SHARDED_STATE_DICT
  fsdp_sharding_strategy: FULL_SHARD
  fsdp_reshard_after_forward: true
  fsdp_activation_checkpointing: true # will disable if doesnt work

# < -- Evals -- >
#evals_per_epoch
#eval_steps: 100
val_set_size: 0.0

# < -- Hparams -- >
warmup_steps: 50
sequence_len: 24576
sample_packing: true
eval_sample_packing: false
pad_to_sequence_len: true

weight_decay: 0.0025
gradient_accumulation_steps: 4
micro_batch_size: 1
num_epoch: 1
max_grad_norm: 1.0
optimizer: adamw_torch_8bit
lr_scheduler: cosine
learning_rate: 1e-5

## data 
datasets:
  - path: estrogen/bookscpt2
    type: completion
    field: text


shuffle_merged_datasets: true
dataset_prepared_path: last_run_prepared
remove_unused_columns: false

# < -- wandb -- >
wandb_project: Koto 106B a6B
wandb_entity:
wandb_watch:
wandb_name: cunning-linguist-1
wandb_log_model:

# < -- Misc -- >
train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: false
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:

Downloads last month: 9

Safetensors

Model size

106B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for allura-org/Koto-Large-106B-Preview

Base model

inclusionAI/Ling-flash-base-2.0

Finetuned

(5)

this model

Quantizations

1 model