This model is a preview, unfinished, and still in development. It is not representative of any final product and has only been remotely published to prove that I am doing something productive with my life
Koto Large 106B-a6B (Preview)
Koto-Large-106B-Preview is a version of Ling-Flash-Base-2.0 trained on almost a billion tokens of creative writing data.
Thanks to lium.io for the compute! <3
Usage
Um. Don't please?
But if you must, our testers found that:
Temp 1.1, min_p 0.01, rep pen 1.02, freq pen -0.04
is the best. somehow
Datasets
Some of the data used to train this model includes:
- Most of The Anarchist Library, a repository for anarchist manifestos and writing (see allura-org/the-anarchist-library)
- A random sample of public domain books from Project Gutenberg
- Furry (anthro and feral) storytelling and smut
- A small subset of known high-quality books and story data
Acknowledgements
- thanks again to fish and co from lium for compute
- thanks to curse for testing, ideas
- thanks to toasty for some data, ideas
- thanks to everyone else in allura for moral support
ilya <3
Technical Appendix
Training Notes
Yeah idek what went wrong here, I'll be for real. It's... oddly really stupid, though it reportedly outputs good prose sometimes. It's similar to GLM-4 32B's base in that nature.
Took ~18hrs on 8xH200. intervitens had already converted the model to use a faster MoE layer for training and I further patched it to use CCE to bring down memory by a little bit.
TorchAO's 8bit Adamw was used for optimization. FSDP was utilized for model sharding (it's a lot more stable and supported than Deepspeed, ime)
Plans for Next Time
I have three theories as to what went wrong:
- MoE training is busted (possible, but the 0.25 epoch checkpoint looked oddly promising, so I don't think that's it)
- The model did not get trained on enough data
- The model was not trained aggressively enough
I have a feeling that higher LR and/or doing more epochs on the data would result in a much better end result.
Axolotl Config
base_model: /root/ling-scm
trust_remote_code: true
load_in_8bit: false
load_in_4bit: false
strict: false
# < -- Saving -- >
output_dir: ./koto-106B-a6B
saves_per_epoch: 4
# < -- Vram Savings -- >
#gradient_checkpointing: true
flash_attention: true
fsdp:
- auto_wrap
- full_shard
fsdp_config:
fsdp_version: 2
fsdp_offload_params: false
fsdp_cpu_ram_efficient_loading: true
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_transformer_layer_cls_to_wrap: BailingSharedMoeV2DecoderLayer
fsdp_state_dict_type: SHARDED_STATE_DICT
fsdp_sharding_strategy: FULL_SHARD
fsdp_reshard_after_forward: true
fsdp_activation_checkpointing: true # will disable if doesnt work
# < -- Evals -- >
#evals_per_epoch
#eval_steps: 100
val_set_size: 0.0
# < -- Hparams -- >
warmup_steps: 50
sequence_len: 24576
sample_packing: true
eval_sample_packing: false
pad_to_sequence_len: true
weight_decay: 0.0025
gradient_accumulation_steps: 4
micro_batch_size: 1
num_epoch: 1
max_grad_norm: 1.0
optimizer: adamw_torch_8bit
lr_scheduler: cosine
learning_rate: 1e-5
## data
datasets:
- path: estrogen/bookscpt2
type: completion
field: text
shuffle_merged_datasets: true
dataset_prepared_path: last_run_prepared
remove_unused_columns: false
# < -- wandb -- >
wandb_project: Koto 106B a6B
wandb_entity:
wandb_watch:
wandb_name: cunning-linguist-1
wandb_log_model:
# < -- Misc -- >
train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: false
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
- Downloads last month
- 9

