---
license: mit
language:
- en
base_model:
- inclusionAI/Ling-flash-base-2.0
tags:
- storywriting
- preview
---

# This model is a preview, unfinished, and still in development. It is not representative of any final product and has only been remotely published to prove that I am doing something productive with my life

## Koto Large 106B-a6B (Preview)

Koto-Large-106B-Preview is a version of Ling-Flash-Base-2.0 trained on almost a billion tokens of creative writing data.

***Thanks to [lium.io](https://lium.io/) for the compute! <3***

## Usage
Um. Don't please?

But if you must, our testers found that:
> Temp 1.1, min_p 0.01, rep pen 1.02, freq pen -0.04

is the best. somehow

## Datasets
Some of the data used to train this model includes:
- Most of [The Anarchist Library](https://theanarchistlibrary.org/), a repository for anarchist manifestos and writing (see [allura-org/the-anarchist-library](https://huggingface.co/datasets/allura-org/the-anarchist-library))
- A random sample of public domain books from Project Gutenberg
- Furry (anthro and feral) storytelling and smut
- A small subset of known high-quality books and story data

## Acknowledgements
- thanks again to fish and co from lium for compute
- thanks to curse for testing, ideas
- thanks to toasty for some data, ideas
- thanks to everyone else in allura for moral support

ilya <3

## Technical Appendix
<details>

### Training Notes

Yeah idek what went wrong here, I'll be for real. It's... oddly *really* stupid, though it reportedly outputs good prose sometimes. It's similar to GLM-4 32B's base in that nature.

![image](https://cdn-uploads.huggingface.co/production/uploads/634262af8d8089ebaefd410e/c3_rf59rJPijY65_8fMMl.png)
![image](https://cdn-uploads.huggingface.co/production/uploads/634262af8d8089ebaefd410e/rB6VeE-Zwy8xOyGur2ki4.png)

Took ~18hrs on 8xH200. intervitens had already converted the model to use a faster MoE layer for training and I further patched it to use CCE to bring down memory by a little bit.

TorchAO's 8bit Adamw was used for optimization. FSDP was utilized for model sharding (it's a lot more stable and supported than Deepspeed, ime)

### Plans for Next Time

I have three theories as to what went wrong:

- MoE training is busted (possible, but the 0.25 epoch checkpoint looked oddly promising, so I don't think that's it)
- The model did not get trained on enough data
- The model was not trained aggressively enough

I have a feeling that higher LR and/or doing more epochs on the data would result in a much better end result.

### Axolotl Config

```yaml
base_model: /root/ling-scm
trust_remote_code: true

load_in_8bit: false
load_in_4bit: false
strict: false

# < -- Saving -- >
output_dir: ./koto-106B-a6B
saves_per_epoch: 4


# < -- Vram Savings -- >
#gradient_checkpointing: true
flash_attention: true

fsdp:
  - auto_wrap
  - full_shard
fsdp_config:
  fsdp_version: 2
  fsdp_offload_params: false
  fsdp_cpu_ram_efficient_loading: true
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_transformer_layer_cls_to_wrap: BailingSharedMoeV2DecoderLayer
  fsdp_state_dict_type: SHARDED_STATE_DICT
  fsdp_sharding_strategy: FULL_SHARD
  fsdp_reshard_after_forward: true
  fsdp_activation_checkpointing: true # will disable if doesnt work

# < -- Evals -- >
#evals_per_epoch
#eval_steps: 100
val_set_size: 0.0

# < -- Hparams -- >
warmup_steps: 50
sequence_len: 24576
sample_packing: true
eval_sample_packing: false
pad_to_sequence_len: true

weight_decay: 0.0025
gradient_accumulation_steps: 4
micro_batch_size: 1
num_epoch: 1
max_grad_norm: 1.0
optimizer: adamw_torch_8bit
lr_scheduler: cosine
learning_rate: 1e-5

## data 
datasets:
  - path: estrogen/bookscpt2
    type: completion
    field: text


shuffle_merged_datasets: true
dataset_prepared_path: last_run_prepared
remove_unused_columns: false

# < -- wandb -- >
wandb_project: Koto 106B a6B
wandb_entity:
wandb_watch:
wandb_name: cunning-linguist-1
wandb_log_model:

# < -- Misc -- >
train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: false
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
```

</details>