--- license: mit language: - en base_model: - inclusionAI/Ling-flash-base-2.0 tags: - storywriting - preview --- # This model is a preview, unfinished, and still in development. It is not representative of any final product and has only been remotely published to prove that I am doing something productive with my life ## Koto Large 106B-a6B (Preview) Koto-Large-106B-Preview is a version of Ling-Flash-Base-2.0 trained on almost a billion tokens of creative writing data. ***Thanks to [lium.io](https://lium.io/) for the compute! <3*** ## Usage Um. Don't please? But if you must, our testers found that: > Temp 1.1, min_p 0.01, rep pen 1.02, freq pen -0.04 is the best. somehow ## Datasets Some of the data used to train this model includes: - Most of [The Anarchist Library](https://theanarchistlibrary.org/), a repository for anarchist manifestos and writing (see [allura-org/the-anarchist-library](https://huggingface.co/datasets/allura-org/the-anarchist-library)) - A random sample of public domain books from Project Gutenberg - Furry (anthro and feral) storytelling and smut - A small subset of known high-quality books and story data ## Acknowledgements - thanks again to fish and co from lium for compute - thanks to curse for testing, ideas - thanks to toasty for some data, ideas - thanks to everyone else in allura for moral support ilya <3 ## Technical Appendix
### Training Notes Yeah idek what went wrong here, I'll be for real. It's... oddly *really* stupid, though it reportedly outputs good prose sometimes. It's similar to GLM-4 32B's base in that nature. ![image](https://cdn-uploads.huggingface.co/production/uploads/634262af8d8089ebaefd410e/c3_rf59rJPijY65_8fMMl.png) ![image](https://cdn-uploads.huggingface.co/production/uploads/634262af8d8089ebaefd410e/rB6VeE-Zwy8xOyGur2ki4.png) Took ~18hrs on 8xH200. intervitens had already converted the model to use a faster MoE layer for training and I further patched it to use CCE to bring down memory by a little bit. TorchAO's 8bit Adamw was used for optimization. FSDP was utilized for model sharding (it's a lot more stable and supported than Deepspeed, ime) ### Plans for Next Time I have three theories as to what went wrong: - MoE training is busted (possible, but the 0.25 epoch checkpoint looked oddly promising, so I don't think that's it) - The model did not get trained on enough data - The model was not trained aggressively enough I have a feeling that higher LR and/or doing more epochs on the data would result in a much better end result. ### Axolotl Config ```yaml base_model: /root/ling-scm trust_remote_code: true load_in_8bit: false load_in_4bit: false strict: false # < -- Saving -- > output_dir: ./koto-106B-a6B saves_per_epoch: 4 # < -- Vram Savings -- > #gradient_checkpointing: true flash_attention: true fsdp: - auto_wrap - full_shard fsdp_config: fsdp_version: 2 fsdp_offload_params: false fsdp_cpu_ram_efficient_loading: true fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP fsdp_transformer_layer_cls_to_wrap: BailingSharedMoeV2DecoderLayer fsdp_state_dict_type: SHARDED_STATE_DICT fsdp_sharding_strategy: FULL_SHARD fsdp_reshard_after_forward: true fsdp_activation_checkpointing: true # will disable if doesnt work # < -- Evals -- > #evals_per_epoch #eval_steps: 100 val_set_size: 0.0 # < -- Hparams -- > warmup_steps: 50 sequence_len: 24576 sample_packing: true eval_sample_packing: false pad_to_sequence_len: true weight_decay: 0.0025 gradient_accumulation_steps: 4 micro_batch_size: 1 num_epoch: 1 max_grad_norm: 1.0 optimizer: adamw_torch_8bit lr_scheduler: cosine learning_rate: 1e-5 ## data datasets: - path: estrogen/bookscpt2 type: completion field: text shuffle_merged_datasets: true dataset_prepared_path: last_run_prepared remove_unused_columns: false # < -- wandb -- > wandb_project: Koto 106B a6B wandb_entity: wandb_watch: wandb_name: cunning-linguist-1 wandb_log_model: # < -- Misc -- > train_on_inputs: false group_by_length: false bf16: auto fp16: tf32: false early_stopping_patience: resume_from_checkpoint: local_rank: logging_steps: 1 xformers_attention: ```