This model is a preview, unfinished, and still in development. It is not representative of any final product and has only been remotely published to prove that I am doing something productive with my life

Koto Large 106B-a6B (Preview)

Koto-Large-106B-Preview is a version of Ling-Flash-Base-2.0 trained on almost a billion tokens of creative writing data.

Thanks to lium.io for the compute! <3

Usage

Um. Don't please?

But if you must, our testers found that:

Temp 1.1, min_p 0.01, rep pen 1.02, freq pen -0.04

is the best. somehow

Datasets

Some of the data used to train this model includes:

  • Most of The Anarchist Library, a repository for anarchist manifestos and writing (see allura-org/the-anarchist-library)
  • A random sample of public domain books from Project Gutenberg
  • Furry (anthro and feral) storytelling and smut
  • A small subset of known high-quality books and story data

Acknowledgements

  • thanks again to fish and co from lium for compute
  • thanks to curse for testing, ideas
  • thanks to toasty for some data, ideas
  • thanks to everyone else in allura for moral support

ilya <3

Technical Appendix

Training Notes

Yeah idek what went wrong here, I'll be for real. It's... oddly really stupid, though it reportedly outputs good prose sometimes. It's similar to GLM-4 32B's base in that nature.

image image

Took ~18hrs on 8xH200. intervitens had already converted the model to use a faster MoE layer for training and I further patched it to use CCE to bring down memory by a little bit.

TorchAO's 8bit Adamw was used for optimization. FSDP was utilized for model sharding (it's a lot more stable and supported than Deepspeed, ime)

Plans for Next Time

I have three theories as to what went wrong:

  • MoE training is busted (possible, but the 0.25 epoch checkpoint looked oddly promising, so I don't think that's it)
  • The model did not get trained on enough data
  • The model was not trained aggressively enough

I have a feeling that higher LR and/or doing more epochs on the data would result in a much better end result.

Axolotl Config

base_model: /root/ling-scm
trust_remote_code: true

load_in_8bit: false
load_in_4bit: false
strict: false

# < -- Saving -- >
output_dir: ./koto-106B-a6B
saves_per_epoch: 4


# < -- Vram Savings -- >
#gradient_checkpointing: true
flash_attention: true

fsdp:
  - auto_wrap
  - full_shard
fsdp_config:
  fsdp_version: 2
  fsdp_offload_params: false
  fsdp_cpu_ram_efficient_loading: true
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_transformer_layer_cls_to_wrap: BailingSharedMoeV2DecoderLayer
  fsdp_state_dict_type: SHARDED_STATE_DICT
  fsdp_sharding_strategy: FULL_SHARD
  fsdp_reshard_after_forward: true
  fsdp_activation_checkpointing: true # will disable if doesnt work

# < -- Evals -- >
#evals_per_epoch
#eval_steps: 100
val_set_size: 0.0

# < -- Hparams -- >
warmup_steps: 50
sequence_len: 24576
sample_packing: true
eval_sample_packing: false
pad_to_sequence_len: true

weight_decay: 0.0025
gradient_accumulation_steps: 4
micro_batch_size: 1
num_epoch: 1
max_grad_norm: 1.0
optimizer: adamw_torch_8bit
lr_scheduler: cosine
learning_rate: 1e-5

## data 
datasets:
  - path: estrogen/bookscpt2
    type: completion
    field: text


shuffle_merged_datasets: true
dataset_prepared_path: last_run_prepared
remove_unused_columns: false

# < -- wandb -- >
wandb_project: Koto 106B a6B
wandb_entity:
wandb_watch:
wandb_name: cunning-linguist-1
wandb_log_model:

# < -- Misc -- >
train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: false
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
Downloads last month
9
Safetensors
Model size
106B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for allura-org/Koto-Large-106B-Preview

Finetuned
(5)
this model
Quantizations
1 model