esunAI commited on
Commit
370f342
·
0 Parent(s):

Initial FlowAMP upload: Complete project with all essential files

Browse files
MODEL_FILES_INFO.md ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Model Files Information
2
+
3
+ ## Available Files
4
+ - normalization_stats.pt: Preprocessing statistics for ESM-2 embeddings
5
+
6
+ ## Missing Files (Too Large for Hugging Face)
7
+ The following model files are too large for Hugging Face upload (>100MB limit):
8
+
9
+ ### Large Model Files (Not Included)
10
+ - flowamp_demo_checkpoint.pth (~1.5GB): Complete model checkpoint
11
+ - compressor_demo.pth (~315MB): Compressor weights
12
+ - decompressor_demo.pth (~158MB): Decompressor weights
13
+ - flow_model_demo.pth (~54MB): Flow model weights
14
+ - apex/trained_models/* (~1GB total): Pre-trained Apex models
15
+
16
+ ### How to Get Model Files
17
+ 1. **Train Your Own**: Use the provided training scripts to train the model
18
+ 2. **Contact Author**: Request model files directly from the author
19
+ 3. **Alternative Storage**: Model files may be available on other platforms
20
+
21
+ ### Training Instructions
22
+ 1. Run the training scripts to generate your own model checkpoints
23
+ 2. Use amp_flow_training_single_gpu_full_data.py for single GPU training
24
+ 3. Use amp_flow_training_multi_gpu.py for multi-GPU training
25
+ 4. Models will be saved automatically during training
26
+
27
+ ### Quick Start
28
+ 1. Install dependencies: pip install -r requirements.txt
29
+ 2. Run usage_example.py to verify installation
30
+ 3. Train the model using provided scripts
31
+ 4. Use generate_amps.py for AMP generation
README.md ADDED
@@ -0,0 +1,159 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # FlowAMP: Flow-based Antimicrobial Peptide Generation
2
+
3
+ ## Overview
4
+
5
+ FlowAMP is a novel flow-based generative model for designing antimicrobial peptides (AMPs) using conditional flow matching and ESM-2 protein language model embeddings. This project implements a state-of-the-art approach for de novo AMP design with improved generation quality and diversity.
6
+
7
+ ## Key Features
8
+
9
+ - **Flow-based Generation**: Uses conditional flow matching for high-quality peptide generation
10
+ - **ESM-2 Integration**: Leverages ESM-2 protein language model embeddings for sequence understanding
11
+ - **CFG Training**: Implements Classifier-Free Guidance for controllable generation
12
+ - **Multi-GPU Training**: Optimized for H100 GPUs with mixed precision training
13
+ - **Comprehensive Evaluation**: MIC prediction and antimicrobial activity assessment
14
+
15
+ ## Project Structure
16
+
17
+ ```
18
+ flow/
19
+ ├── final_flow_model.py # Main FlowAMP model architecture
20
+ ├── final_sequence_encoder.py # ESM-2 sequence encoding
21
+ ├── final_sequence_decoder.py # Sequence decoding and generation
22
+ ├── compressor_with_embeddings.py # Embedding compression/decompression
23
+ ├── cfg_dataset.py # CFG dataset and dataloader
24
+ ├── amp_flow_training_single_gpu_full_data.py # Single GPU training
25
+ ├── amp_flow_training_multi_gpu.py # Multi-GPU training
26
+ ├── generate_amps.py # AMP generation script
27
+ ├── test_generated_peptides.py # Evaluation and testing
28
+ ├── apex/ # Apex model integration
29
+ │ ├── trained_models/ # Pre-trained Apex models
30
+ │ └── AMP_DL_model_twohead.py # Apex model architecture
31
+ ├── normalization_stats.pt # Preprocessing statistics
32
+ └── requirements.yaml # Dependencies
33
+ ```
34
+
35
+ ## Model Architecture
36
+
37
+ The FlowAMP model consists of:
38
+
39
+ 1. **ESM-2 Encoder**: Extracts protein sequence embeddings using ESM-2
40
+ 2. **Compressor/Decompressor**: Reduces embedding dimensionality for efficiency
41
+ 3. **Flow Matcher**: Conditional flow matching for generation
42
+ 4. **CFG Integration**: Classifier-free guidance for controllable generation
43
+
44
+ ## Training
45
+
46
+ ### Single GPU Training
47
+ ```bash
48
+ python amp_flow_training_single_gpu_full_data.py
49
+ ```
50
+
51
+ ### Multi-GPU Training
52
+ ```bash
53
+ bash launch_multi_gpu_training.sh
54
+ ```
55
+
56
+ ### Key Training Parameters
57
+ - **Batch Size**: 96 (optimized for H100)
58
+ - **Learning Rate**: 4e-4 with cosine annealing
59
+ - **Epochs**: 6000
60
+ - **Mixed Precision**: BF16 for H100 optimization
61
+ - **CFG Dropout**: 15% for unconditional training
62
+
63
+ ## Generation
64
+
65
+ Generate AMPs with different CFG strengths:
66
+
67
+ ```bash
68
+ python generate_amps.py --cfg_strength 0.0 # No CFG
69
+ python generate_amps.py --cfg_strength 1.0 # Weak CFG
70
+ python generate_amps.py --cfg_strength 2.0 # Strong CFG
71
+ python generate_amps.py --cfg_strength 3.0 # Very Strong CFG
72
+ ```
73
+
74
+ ## Evaluation
75
+
76
+ ### MIC Prediction
77
+ The model includes integration with Apex for MIC (Minimum Inhibitory Concentration) prediction:
78
+
79
+ ```bash
80
+ python test_generated_peptides.py
81
+ ```
82
+
83
+ ### Performance Metrics
84
+ - **Generation Quality**: Evaluated using sequence diversity and validity
85
+ - **Antimicrobial Activity**: Predicted using Apex model integration
86
+ - **CFG Effectiveness**: Measured through controlled generation
87
+
88
+ ## Results
89
+
90
+ ### Training Performance
91
+ - **Optimized for H100**: 31 steps/second with batch size 96
92
+ - **Mixed Precision**: BF16 training for memory efficiency
93
+ - **Gradient Clipping**: Stable training with norm=1.0
94
+
95
+ ### Generation Results
96
+ - **Sequence Validity**: High percentage of valid peptide sequences
97
+ - **Diversity**: Good sequence diversity across different CFG strengths
98
+ - **Antimicrobial Potential**: Predicted MIC values for generated sequences
99
+
100
+ ## Dependencies
101
+
102
+ Key dependencies include:
103
+ - PyTorch 2.0+
104
+ - Transformers (for ESM-2)
105
+ - Wandb (optional logging)
106
+ - Apex (for MIC prediction)
107
+
108
+ See `requirements.yaml` for complete dependency list.
109
+
110
+ ## Usage Examples
111
+
112
+ ### Basic AMP Generation
113
+ ```python
114
+ from final_flow_model import AMPFlowMatcherCFGConcat
115
+ from generate_amps import generate_amps
116
+
117
+ # Load trained model
118
+ model = AMPFlowMatcherCFGConcat.load_from_checkpoint('path/to/checkpoint.pth')
119
+
120
+ # Generate AMPs
121
+ sequences = generate_amps(model, num_samples=100, cfg_strength=1.0)
122
+ ```
123
+
124
+ ### Evaluation
125
+ ```python
126
+ from test_generated_peptides import evaluate_generated_peptides
127
+
128
+ # Evaluate generated sequences
129
+ results = evaluate_generated_peptides(sequences)
130
+ ```
131
+
132
+ ## Research Impact
133
+
134
+ This work contributes to:
135
+ - **Flow-based Protein Design**: Novel application of flow matching to peptide generation
136
+ - **Conditional Generation**: CFG integration for controllable AMP design
137
+ - **ESM-2 Integration**: Leveraging protein language models for sequence understanding
138
+ - **Antimicrobial Discovery**: Automated design of potential therapeutic peptides
139
+
140
+ ## Citation
141
+
142
+ If you use this code in your research, please cite:
143
+
144
+ ```bibtex
145
+ @article{flowamp2024,
146
+ title={FlowAMP: Flow-based Antimicrobial Peptide Generation with Conditional Flow Matching},
147
+ author={Sun, Edward},
148
+ journal={arXiv preprint},
149
+ year={2024}
150
+ }
151
+ ```
152
+
153
+ ## License
154
+
155
+ MIT License - see LICENSE file for details.
156
+
157
+ ## Contact
158
+
159
+ For questions or collaboration, please contact the authors.
UPLOAD_INSTRUCTIONS.txt ADDED
@@ -0,0 +1,48 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ === Upload Instructions ===
3
+
4
+ 1. Navigate to the upload directory:
5
+ cd flowamp_upload_small
6
+
7
+ 2. Initialize git repository:
8
+ git init
9
+ git add .
10
+ git commit -m "Initial FlowAMP upload (small version)"
11
+
12
+ 3. Add Hugging Face remote:
13
+ git remote add origin https://huggingface.co/esunAI/FlowAMP
14
+
15
+ 4. Push to Hugging Face:
16
+ git push -u origin main
17
+
18
+ === Files Included ===
19
+
20
+ Core Model:
21
+ - final_flow_model.py: Main FlowAMP model architecture
22
+ - final_sequence_encoder.py: ESM-2 sequence encoding
23
+ - final_sequence_decoder.py: Sequence decoding and generation
24
+ - compressor_with_embeddings.py: Embedding compression/decompression
25
+ - cfg_dataset.py: CFG dataset and dataloader
26
+
27
+ Training:
28
+ - amp_flow_training_single_gpu_full_data.py: Single GPU training
29
+ - amp_flow_training_multi_gpu.py: Multi-GPU training
30
+ - launch_*.sh: Training launch scripts
31
+
32
+ Models:
33
+ - normalization_stats.pt: Preprocessing statistics
34
+ - MODEL_FILES_INFO.md: Information about missing large model files
35
+
36
+ Apex Integration:
37
+ - apex/AMP_DL_model_twohead.py: Apex model architecture
38
+ - apex/predict.py: MIC prediction script
39
+
40
+ Documentation:
41
+ - README.md: Comprehensive project documentation
42
+ - model_card.md: Hugging Face model card
43
+ - usage_example.py: Usage demonstration
44
+ - requirements.txt: Python dependencies
45
+
46
+ === Note ===
47
+ This is a smaller version without large model files due to Hugging Face size limits.
48
+ See MODEL_FILES_INFO.md for details on obtaining model weights.
amp_flow_training_multi_gpu.py ADDED
@@ -0,0 +1,439 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import torch
2
+ import torch.nn as nn
3
+ import torch.optim as optim
4
+ from torch.utils.data import DataLoader
5
+ from torch.optim.lr_scheduler import CosineAnnealingLR, LinearLR, SequentialLR
6
+ from torch.nn.parallel import DistributedDataParallel as DDP
7
+ from torch.utils.data.distributed import DistributedSampler
8
+ import torch.distributed as dist
9
+ import numpy as np
10
+ from tqdm import tqdm
11
+ import json
12
+ import os
13
+ import argparse
14
+
15
+ # Import your existing components
16
+ from compressor_with_embeddings import Compressor, Decompressor, PrecomputedEmbeddingDataset
17
+ from final_flow_model import AMPFlowMatcherCFGConcat, SinusoidalTimeEmbedding
18
+ from cfg_dataset import CFGFlowDataset, create_cfg_dataloader
19
+
20
+ # ---------------- Configuration ----------------
21
+ ESM_DIM = 1280 # ESM-2 hidden dim (esm2_t33_650M_UR50D)
22
+ COMP_RATIO = 16 # compression factor
23
+ COMP_DIM = ESM_DIM // COMP_RATIO
24
+ MAX_SEQ_LEN = 50 # Actual sequence length from final_sequence_encoder.py
25
+ BATCH_SIZE = 64 # Per GPU batch size (256 total across 4 GPUs) - increased for faster training
26
+ EPOCHS = 5000 # Extended to 5K iterations for more comprehensive training (~50 minutes)
27
+ BASE_LR = 1e-4 # initial learning rate
28
+ LR_MIN = 2e-5 # minimum learning rate for cosine schedule
29
+ WARMUP_STEPS = 100 # Reduced warmup for shorter training
30
+
31
+ def setup_distributed():
32
+ """Setup distributed training."""
33
+ if 'RANK' in os.environ and 'WORLD_SIZE' in os.environ:
34
+ rank = int(os.environ["RANK"])
35
+ world_size = int(os.environ['WORLD_SIZE'])
36
+ local_rank = int(os.environ['LOCAL_RANK'])
37
+ else:
38
+ print('Not using distributed mode')
39
+ return None, None, None
40
+
41
+ torch.cuda.set_device(local_rank)
42
+ dist.init_process_group(backend='nccl', init_method='env://')
43
+ dist.barrier()
44
+
45
+ return rank, world_size, local_rank
46
+
47
+ class AMPFlowTrainerMultiGPU:
48
+ """
49
+ Multi-GPU training pipeline for AMP generation using ProtFlow methodology.
50
+ """
51
+
52
+ def __init__(self, embeddings_path, cfg_data_path, rank, world_size, local_rank):
53
+ self.rank = rank
54
+ self.world_size = world_size
55
+ self.local_rank = local_rank
56
+ self.device = torch.device(f'cuda:{local_rank}')
57
+ self.embeddings_path = embeddings_path
58
+ self.cfg_data_path = cfg_data_path
59
+
60
+ # Load ALL pre-computed embeddings (only on main process)
61
+ if self.rank == 0:
62
+ print(f"Loading ALL AMP embeddings from {embeddings_path}...")
63
+
64
+ # Try to load the combined embeddings file first (FULL DATA)
65
+ combined_path = os.path.join(embeddings_path, "all_peptide_embeddings.pt")
66
+
67
+ if os.path.exists(combined_path):
68
+ print(f"Loading combined embeddings from {combined_path} (FULL DATA)...")
69
+ self.embeddings = torch.load(combined_path, map_location=self.device)
70
+ print(f"✓ Loaded ALL embeddings: {self.embeddings.shape}")
71
+ else:
72
+ print("Combined embeddings file not found, loading individual files...")
73
+ # Fallback to individual files
74
+ import glob
75
+
76
+ embedding_files = glob.glob(os.path.join(embeddings_path, "*.pt"))
77
+ embedding_files = [f for f in embedding_files if not f.endswith('metadata.json') and not f.endswith('sequence_ids.json') and not f.endswith('all_peptide_embeddings.pt')]
78
+
79
+ print(f"Found {len(embedding_files)} individual embedding files")
80
+
81
+ # Load and stack all embeddings
82
+ embeddings_list = []
83
+ for file_path in embedding_files:
84
+ try:
85
+ embedding = torch.load(file_path)
86
+ if embedding.dim() == 2: # (seq_len, hidden_dim)
87
+ embeddings_list.append(embedding)
88
+ else:
89
+ print(f"Warning: Skipping {file_path} - unexpected shape {embedding.shape}")
90
+ except Exception as e:
91
+ print(f"Warning: Could not load {file_path}: {e}")
92
+
93
+ if not embeddings_list:
94
+ raise ValueError("No valid embeddings found!")
95
+
96
+ self.embeddings = torch.stack(embeddings_list)
97
+ print(f"Loaded {len(self.embeddings)} embeddings from individual files")
98
+
99
+ # Compute normalization statistics
100
+ print("Computing preprocessing statistics...")
101
+ self._compute_preprocessing_stats()
102
+
103
+ # Broadcast statistics to all processes
104
+ if self.rank == 0:
105
+ stats_tensor = torch.stack([
106
+ self.stats['mean'], self.stats['std'],
107
+ self.stats['min'], self.stats['max']
108
+ ]).to(self.device)
109
+ else:
110
+ stats_tensor = torch.zeros(4, ESM_DIM, device=self.device)
111
+
112
+ dist.broadcast(stats_tensor, src=0)
113
+
114
+ if self.rank != 0:
115
+ self.stats = {
116
+ 'mean': stats_tensor[0],
117
+ 'std': stats_tensor[1],
118
+ 'min': stats_tensor[2],
119
+ 'max': stats_tensor[3]
120
+ }
121
+
122
+ # Initialize models
123
+ self._initialize_models()
124
+
125
+ def _compute_preprocessing_stats(self):
126
+ """Compute preprocessing statistics (only on main process)."""
127
+ # Flatten all embeddings
128
+ flat = self.embeddings.view(-1, ESM_DIM)
129
+
130
+ # 1. Z-score normalization statistics
131
+ feat_mean = flat.mean(0)
132
+ feat_std = flat.std(0) + 1e-8
133
+
134
+ # 2. Truncation statistics (after z-score)
135
+ z_score_normalized = (flat - feat_mean) / feat_std
136
+ z_score_clamped = torch.clamp(z_score_normalized, -4, 4)
137
+
138
+ # 3. Min-max normalization statistics (after truncation)
139
+ feat_min = z_score_clamped.min(0)[0]
140
+ feat_max = z_score_clamped.max(0)[0]
141
+
142
+ # Store statistics
143
+ self.stats = {
144
+ 'mean': feat_mean,
145
+ 'std': feat_std,
146
+ 'min': feat_min,
147
+ 'max': feat_max
148
+ }
149
+
150
+ # Save statistics for later use
151
+ torch.save(self.stats, 'normalization_stats.pt')
152
+ if self.rank == 0:
153
+ print("✓ Preprocessing statistics computed and saved to normalization_stats.pt")
154
+
155
+ def _initialize_models(self):
156
+ """Initialize models for distributed training."""
157
+ # Load pre-trained compressor and decompressor
158
+ self.compressor = Compressor().to(self.device)
159
+ self.decompressor = Decompressor().to(self.device)
160
+
161
+ # Load trained weights
162
+ self.compressor.load_state_dict(torch.load('final_compressor_model.pth', map_location=self.device))
163
+ self.decompressor.load_state_dict(torch.load('final_decompressor_model.pth', map_location=self.device))
164
+
165
+ # Initialize flow matching model with CFG
166
+ self.flow_model = AMPFlowMatcherCFGConcat(
167
+ hidden_dim=480,
168
+ compressed_dim=COMP_DIM,
169
+ n_layers=12,
170
+ n_heads=16,
171
+ dim_ff=3072,
172
+ max_seq_len=25,
173
+ use_cfg=True
174
+ ).to(self.device)
175
+
176
+ # Wrap with DDP
177
+ self.flow_model = DDP(self.flow_model, device_ids=[self.local_rank], find_unused_parameters=True)
178
+
179
+ if self.rank == 0:
180
+ print("✓ Initialized models for distributed training")
181
+ print(f" - Flow model parameters: {sum(p.numel() for p in self.flow_model.parameters()):,}")
182
+ print(f" - Using {self.world_size} GPUs")
183
+
184
+ def _preprocess_batch(self, batch):
185
+ """Apply preprocessing to a batch of embeddings."""
186
+ # 1. Z-score normalization
187
+ h_norm = (batch - self.stats['mean'].to(batch.device)) / self.stats['std'].to(batch.device)
188
+
189
+ # 2. Truncation (saturation) of outliers
190
+ h_trunc = torch.clamp(h_norm, min=-4.0, max=4.0)
191
+
192
+ # 3. Min-max normalization per dimension
193
+ h_min = self.stats['min'].to(batch.device)
194
+ h_max = self.stats['max'].to(batch.device)
195
+ h_scaled = (h_trunc - h_min) / (h_max - h_min + 1e-8)
196
+ h_scaled = torch.clamp(h_scaled, 0.0, 1.0)
197
+
198
+ return h_scaled
199
+
200
+ def train_flow_matching(self):
201
+ """Train the flow matching model using distributed training."""
202
+ if self.rank == 0:
203
+ print("Step 3: Training Flow Matching model (Multi-GPU)...")
204
+
205
+ # Create CFG dataset and distributed data loader
206
+ try:
207
+ # Try to use CFG dataset with real labels
208
+ dataset = CFGFlowDataset(
209
+ embeddings_path=self.embeddings_path,
210
+ cfg_data_path=self.cfg_data_path,
211
+ use_masked_labels=True,
212
+ max_seq_len=MAX_SEQ_LEN,
213
+ device=self.device
214
+ )
215
+ print("✓ Using CFG dataset with real labels")
216
+ except Exception as e:
217
+ print(f"Warning: Could not load CFG dataset: {e}")
218
+ print("Falling back to random labels (not recommended for CFG)")
219
+ # Fallback to original dataset with random labels
220
+ dataset = PrecomputedEmbeddingDataset(self.embeddings_path)
221
+
222
+ sampler = DistributedSampler(dataset, num_replicas=self.world_size, rank=self.rank)
223
+ dataloader = DataLoader(dataset, batch_size=BATCH_SIZE, sampler=sampler, num_workers=4)
224
+
225
+ # Initialize optimizer
226
+ optimizer = optim.AdamW(
227
+ self.flow_model.parameters(),
228
+ lr=BASE_LR,
229
+ betas=(0.9, 0.98),
230
+ weight_decay=0.01,
231
+ eps=1e-6
232
+ )
233
+
234
+ # LR scheduling: warmup -> cosine
235
+ warmup_sched = LinearLR(optimizer, start_factor=1e-8, end_factor=1.0, total_iters=WARMUP_STEPS)
236
+ cosine_sched = CosineAnnealingLR(optimizer, T_max=EPOCHS, eta_min=LR_MIN)
237
+ scheduler = SequentialLR(optimizer, [warmup_sched, cosine_sched], milestones=[WARMUP_STEPS])
238
+
239
+ # Training loop
240
+ self.flow_model.train()
241
+ total_steps = 0
242
+
243
+ if self.rank == 0:
244
+ print(f"Starting training for {EPOCHS} iterations with FULL DATA...")
245
+ print(f"Total batch size: {BATCH_SIZE * self.world_size}")
246
+ print(f"Steps per epoch: {len(dataloader)}")
247
+ print(f"Total samples: {len(dataset):,}")
248
+ print(f"Estimated time: ~30-45 minutes (using ALL data)")
249
+
250
+ for epoch in range(EPOCHS):
251
+ sampler.set_epoch(epoch) # Ensure different shuffling per epoch
252
+
253
+ for batch_idx, batch_data in enumerate(dataloader):
254
+ # Handle different data formats
255
+ if isinstance(batch_data, dict) and 'embeddings' in batch_data:
256
+ # CFG dataset format
257
+ x = batch_data['embeddings'].to(self.device)
258
+ labels = batch_data['labels'].to(self.device)
259
+ else:
260
+ # Original dataset format - use random labels
261
+ x = batch_data.to(self.device)
262
+ labels = torch.randint(0, 3, (x.shape[0],), device=self.device)
263
+
264
+ batch_size = x.shape[0]
265
+
266
+ # Apply preprocessing
267
+ x_processed = self._preprocess_batch(x)
268
+
269
+ # Compress to latent space
270
+ with torch.no_grad():
271
+ z = self.compressor(x_processed, self.stats)
272
+
273
+ # Sample random noise
274
+ eps = torch.randn_like(z)
275
+
276
+ # Sample random time
277
+ t = torch.rand(batch_size, device=self.device)
278
+
279
+ # Interpolate between data and noise
280
+ xt = t.view(batch_size, 1, 1) * eps + (1 - t.view(batch_size, 1, 1)) * z
281
+
282
+ # Target vector field for rectified flow
283
+ ut = eps - z
284
+
285
+ # Use real labels from CFG dataset or random labels as fallback
286
+ # labels are already defined above based on dataset type
287
+
288
+ # Predict vector field with CFG
289
+ vt_pred = self.flow_model(xt, t, labels=labels)
290
+
291
+ # CFM loss
292
+ loss = ((vt_pred - ut) ** 2).mean()
293
+
294
+ # Backward pass
295
+ optimizer.zero_grad()
296
+ loss.backward()
297
+
298
+ # Gradient clipping
299
+ torch.nn.utils.clip_grad_norm_(self.flow_model.parameters(), 1.0)
300
+
301
+ optimizer.step()
302
+ scheduler.step()
303
+
304
+ total_steps += 1
305
+
306
+ # Logging (only on main process) - more frequent for short training
307
+ if self.rank == 0 and total_steps % 10 == 0:
308
+ progress = (total_steps / EPOCHS) * 100
309
+ label_dist = torch.bincount(labels, minlength=3)
310
+ print(f"Step {total_steps}/{EPOCHS} ({progress:.1f}%): Loss = {loss.item():.6f}, LR = {scheduler.get_last_lr()[0]:.2e}, Labels: AMP={label_dist[0]}, Non-AMP={label_dist[1]}, Mask={label_dist[2]}")
311
+
312
+ # Save checkpoint (only on main process) - more frequent for short training
313
+ if self.rank == 0 and total_steps % 100 == 0:
314
+ self._save_checkpoint(total_steps, loss.item())
315
+
316
+ # Validation (only on main process) - more frequent for short training
317
+ if self.rank == 0 and total_steps % 200 == 0:
318
+ self._validate()
319
+
320
+ # Save final model (only on main process)
321
+ if self.rank == 0:
322
+ self._save_checkpoint(total_steps, loss.item(), is_final=True)
323
+ print("✓ Flow matching training completed!")
324
+
325
+ def _save_checkpoint(self, step, loss, is_final=False):
326
+ """Save training checkpoint (only on main process)."""
327
+ # Get the underlying model from DDP
328
+ model_state_dict = self.flow_model.module.state_dict()
329
+
330
+ checkpoint = {
331
+ 'step': step,
332
+ 'flow_model_state_dict': model_state_dict,
333
+ 'loss': loss,
334
+ }
335
+
336
+ if is_final:
337
+ torch.save(checkpoint, 'amp_flow_model_final_full_data.pth')
338
+ print(f"✓ Final model saved: amp_flow_model_final_full_data.pth")
339
+ else:
340
+ torch.save(checkpoint, f'amp_flow_checkpoint_full_data_step_{step}.pth')
341
+ print(f"✓ Checkpoint saved: amp_flow_checkpoint_full_data_step_{step}.pth")
342
+
343
+ def _validate(self):
344
+ """Validate the model by generating a few samples."""
345
+ print("Generating validation samples...")
346
+ self.flow_model.eval()
347
+
348
+ with torch.no_grad():
349
+ # Generate a few samples
350
+ eps = torch.randn(4, 25, COMP_DIM, device=self.device)
351
+ xt = eps.clone()
352
+
353
+ # 25-step generation with CFG (using AMP label)
354
+ labels = torch.full((4,), 0, device=self.device) # 0 = AMP
355
+ for step in range(25):
356
+ t = torch.ones(4, device=self.device) * (1.0 - step/25)
357
+ vt = self.flow_model(xt, t, labels=labels)
358
+ dt = 1.0 / 25
359
+ xt = xt + vt * dt
360
+
361
+ # Decompress
362
+ decompressed = self.decompressor(xt)
363
+
364
+ # Apply reverse preprocessing
365
+ m, s, mn, mx = self.stats['mean'].to(self.device), self.stats['std'].to(self.device), self.stats['min'].to(self.device), self.stats['max'].to(self.device)
366
+ decompressed = decompressed * (mx - mn + 1e-8) + mn
367
+ decompressed = decompressed * s + m
368
+
369
+ print(f" Generated samples shape: {decompressed.shape}")
370
+ print(f" Sample stats - Mean: {decompressed.mean():.4f}, Std: {decompressed.std():.4f}")
371
+
372
+ self.flow_model.train()
373
+
374
+ def main():
375
+ """Main training function with distributed setup."""
376
+ parser = argparse.ArgumentParser()
377
+ parser.add_argument('--local_rank', type=int, default=0)
378
+ parser.add_argument('--cfg_data_path', type=str, default='/data2/edwardsun/flow_project/test_uniprot_processed/uniprot_processed_data.json',
379
+ help='Path to FULL CFG training data with real labels')
380
+ args = parser.parse_args()
381
+
382
+ # Setup distributed training
383
+ rank, world_size, local_rank = setup_distributed()
384
+
385
+ if rank == 0:
386
+ print("=== Multi-GPU AMP Flow Matching Training Pipeline with FULL DATA ===")
387
+ print("This implements the complete ProtFlow methodology for AMP generation.")
388
+ print("Training for 5,000 iterations (~30-45 minutes) using ALL available data.")
389
+ print()
390
+
391
+ # Check if required files exist
392
+ required_files = [
393
+ 'final_compressor_model.pth',
394
+ 'final_decompressor_model.pth',
395
+ '/data2/edwardsun/flow_project/peptide_embeddings/'
396
+ ]
397
+
398
+ for file in required_files:
399
+ if not os.path.exists(file):
400
+ print(f"❌ Missing required file: {file}")
401
+ print("Please ensure you have:")
402
+ print("1. Run final_sequence_encoder.py to generate embeddings")
403
+ print("2. Run compressor_with_embeddings.py to train compressor/decompressor")
404
+ return
405
+
406
+ # Check if CFG data exists
407
+ if not os.path.exists(args.cfg_data_path):
408
+ print(f"⚠️ CFG data not found: {args.cfg_data_path}")
409
+ print("Training will use random labels (not recommended for CFG)")
410
+ print("To use real labels, run uniprot_data_processor.py first")
411
+ else:
412
+ print(f"✓ CFG data found: {args.cfg_data_path}")
413
+
414
+ print("✓ All required files found!")
415
+ print()
416
+
417
+ # Initialize trainer
418
+ trainer = AMPFlowTrainerMultiGPU(
419
+ embeddings_path='/data2/edwardsun/flow_project/peptide_embeddings/',
420
+ cfg_data_path=args.cfg_data_path,
421
+ rank=rank,
422
+ world_size=world_size,
423
+ local_rank=local_rank
424
+ )
425
+
426
+ # Train flow matching model
427
+ trainer.train_flow_matching()
428
+
429
+ if rank == 0:
430
+ print("\n=== Multi-GPU Training Complete with FULL DATA ===")
431
+ print("Your AMP flow matching model trained on ALL available data!")
432
+ print("Next steps:")
433
+ print("1. Test the model: python generate_amps.py")
434
+ print("2. Compare performance with previous model")
435
+ print("3. Implement reflow for 1-step generation")
436
+ print("4. Add conditioning for toxicity (future project)")
437
+
438
+ if __name__ == "__main__":
439
+ main()
amp_flow_training_single_gpu_full_data.py ADDED
@@ -0,0 +1,561 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import torch
2
+ import torch.nn as nn
3
+ import torch.nn.functional as F
4
+ import torch.optim as optim
5
+ from torch.utils.data import DataLoader
6
+ from torch.optim.lr_scheduler import CosineAnnealingLR, LinearLR, SequentialLR
7
+ import numpy as np
8
+ from tqdm import tqdm
9
+ import json
10
+ import os
11
+ import argparse
12
+ import time
13
+ from torch.cuda.amp import autocast, GradScaler
14
+ import wandb # For logging (optional)
15
+
16
+ # Import your existing components
17
+ from compressor_with_embeddings import Compressor, Decompressor, PrecomputedEmbeddingDataset
18
+ from final_flow_model import AMPFlowMatcherCFGConcat, SinusoidalTimeEmbedding
19
+ from cfg_dataset import CFGFlowDataset, create_cfg_dataloader
20
+
21
+ # ---------------- Optimized Configuration for H100 ----------------
22
+ ESM_DIM = 1280 # ESM-2 hidden dim (esm2_t33_650M_UR50D)
23
+ COMP_RATIO = 16 # compression factor
24
+ COMP_DIM = ESM_DIM // COMP_RATIO
25
+ MAX_SEQ_LEN = 50 # Actual sequence length from final_sequence_encoder.py
26
+
27
+ # Optimized hyperparameters for H100 overnight training
28
+ BATCH_SIZE = 96 # Optimized based on profiling (fastest speed: 31 steps/s)
29
+ EPOCHS = 6000 # Adjusted for 8-10 hours with batch size 96 (31 steps/s)
30
+ BASE_LR = 4e-4 # Increased from 1e-4 (scaled with batch size)
31
+ LR_MIN = 2e-4 # Minimum learning rate for cosine schedule
32
+ WARMUP_STEPS = 5000 # 5% of total iterations for warmup
33
+ GPU_ID = 3 # Use GPU 3 (the idle one)
34
+
35
+ # Training optimizations
36
+ USE_MIXED_PRECISION = True # BF16 for H100
37
+ GRADIENT_CLIP_NORM = 1.0 # Gradient clipping for stability
38
+ WEIGHT_DECAY = 0.01 # Weight decay for regularization
39
+ VALIDATION_INTERVAL = 10000 # Validate every 10K steps
40
+ CHECKPOINT_INTERVAL = 1000 # Save checkpoint every 1000 epochs
41
+ NUM_WORKERS = 16 # Increased data loading workers
42
+
43
+ # CFG training parameters
44
+ CFG_DROPOUT_RATE = 0.15 # 15% of batches as unconditional for CFG
45
+
46
+ class AMPFlowTrainerSingleGPUFullData:
47
+ """
48
+ Optimized Single GPU training pipeline for AMP generation using ProtFlow methodology.
49
+ Uses ALL available data with H100-optimized settings for overnight training.
50
+ """
51
+
52
+ def __init__(self, embeddings_path, cfg_data_path, use_wandb=False):
53
+ self.device = torch.device(f'cuda:{GPU_ID}')
54
+ self.embeddings_path = embeddings_path
55
+ self.cfg_data_path = cfg_data_path
56
+ self.use_wandb = use_wandb
57
+
58
+ # Enable H100 optimizations
59
+ torch.backends.cuda.matmul.allow_tf32 = True
60
+ torch.backends.cudnn.allow_tf32 = True
61
+
62
+ print(f"Using GPU {GPU_ID} for optimized H100 training")
63
+ print(f"Mixed precision: {USE_MIXED_PRECISION}")
64
+ print(f"Batch size: {BATCH_SIZE}")
65
+ print(f"Target epochs: {EPOCHS}")
66
+ print(f"Learning rate: {BASE_LR} -> {LR_MIN}")
67
+
68
+ # Initialize mixed precision training
69
+ if USE_MIXED_PRECISION:
70
+ self.scaler = GradScaler()
71
+ print("✓ Mixed precision training enabled (BF16)")
72
+
73
+ # Initialize wandb if requested
74
+ if self.use_wandb:
75
+ wandb.init(
76
+ project="amp-flow-training",
77
+ config={
78
+ "batch_size": BATCH_SIZE,
79
+ "epochs": EPOCHS,
80
+ "base_lr": BASE_LR,
81
+ "lr_min": LR_MIN,
82
+ "warmup_steps": WARMUP_STEPS,
83
+ "mixed_precision": USE_MIXED_PRECISION,
84
+ "gradient_clip": GRADIENT_CLIP_NORM,
85
+ "weight_decay": WEIGHT_DECAY
86
+ }
87
+ )
88
+
89
+ print(f"Loading ALL AMP embeddings from {embeddings_path}...")
90
+
91
+ # Load ALL embeddings (use the combined file instead of individual files)
92
+ self._load_all_embeddings()
93
+
94
+ # Compute normalization statistics
95
+ print("Computing preprocessing statistics...")
96
+ self._compute_preprocessing_stats()
97
+
98
+ # Initialize models
99
+ self._initialize_models()
100
+
101
+ # Initialize datasets and dataloaders
102
+ self._initialize_data()
103
+
104
+ # Initialize optimizer and scheduler
105
+ self._initialize_optimizer()
106
+
107
+ print("✓ Optimized Single GPU training setup complete with FULL DATA!")
108
+
109
+ def _load_all_embeddings(self):
110
+ """Load ALL peptide embeddings from the combined file."""
111
+ # Try to load the combined embeddings file first
112
+ combined_path = os.path.join(self.embeddings_path, "all_peptide_embeddings.pt")
113
+
114
+ if os.path.exists(combined_path):
115
+ print(f"Loading combined embeddings from {combined_path}...")
116
+ self.embeddings = torch.load(combined_path, map_location=self.device)
117
+ print(f"✓ Loaded ALL embeddings: {self.embeddings.shape}")
118
+ else:
119
+ print("Combined embeddings file not found, loading individual files...")
120
+ # Fallback to individual files
121
+ import glob
122
+
123
+ embedding_files = glob.glob(os.path.join(self.embeddings_path, "*.pt"))
124
+ embedding_files = [f for f in embedding_files if not f.endswith('metadata.json') and not f.endswith('sequence_ids.json') and not f.endswith('all_peptide_embeddings.pt')]
125
+
126
+ print(f"Found {len(embedding_files)} individual embedding files")
127
+
128
+ # Load and stack all embeddings
129
+ embeddings_list = []
130
+ for file_path in embedding_files:
131
+ try:
132
+ embedding = torch.load(file_path)
133
+ if embedding.dim() == 2: # (seq_len, hidden_dim)
134
+ embeddings_list.append(embedding)
135
+ else:
136
+ print(f"Warning: Skipping {file_path} - unexpected shape {embedding.shape}")
137
+ except Exception as e:
138
+ print(f"Warning: Could not load {file_path}: {e}")
139
+
140
+ if not embeddings_list:
141
+ raise ValueError("No valid embeddings found!")
142
+
143
+ self.embeddings = torch.stack(embeddings_list)
144
+ print(f"Loaded {len(self.embeddings)} embeddings from individual files")
145
+
146
+ def _compute_preprocessing_stats(self):
147
+ """Compute normalization statistics for embeddings."""
148
+ # Flatten all embeddings
149
+ flat_embeddings = self.embeddings.reshape(-1, ESM_DIM)
150
+
151
+ # Compute statistics
152
+ mean = flat_embeddings.mean(dim=0)
153
+ std = flat_embeddings.std(dim=0)
154
+ min_val = flat_embeddings.min()
155
+ max_val = flat_embeddings.max()
156
+
157
+ self.stats = {
158
+ 'mean': mean,
159
+ 'std': std,
160
+ 'min': min_val,
161
+ 'max': max_val
162
+ }
163
+
164
+ # Save statistics
165
+ torch.save(self.stats, 'normalization_stats.pt')
166
+ print(f"✓ Statistics computed and saved:")
167
+ print(f" Total embeddings: {len(self.embeddings):,}")
168
+ print(f" Mean: {mean.mean():.4f} ± {mean.std():.4f}")
169
+ print(f" Std: {std.mean():.4f} ± {std.std():.4f}")
170
+ print(f" Range: [{min_val:.4f}, {max_val:.4f}]")
171
+
172
+ def _initialize_models(self):
173
+ """Initialize compressor, decompressor, and flow model."""
174
+ print("Initializing models...")
175
+
176
+ # Load pre-trained compressor and decompressor
177
+ self.compressor = Compressor().to(self.device)
178
+ self.decompressor = Decompressor().to(self.device)
179
+
180
+ self.compressor.load_state_dict(torch.load('final_compressor_model.pth', map_location=self.device))
181
+ self.decompressor.load_state_dict(torch.load('final_decompressor_model.pth', map_location=self.device))
182
+
183
+ # Initialize flow model with CFG
184
+ self.flow_model = AMPFlowMatcherCFGConcat(
185
+ hidden_dim=480,
186
+ compressed_dim=COMP_DIM,
187
+ n_layers=12,
188
+ n_heads=16,
189
+ dim_ff=3072,
190
+ max_seq_len=25, # MAX_SEQ_LEN // 2 due to pooling
191
+ use_cfg=True
192
+ ).to(self.device)
193
+
194
+ # Compile model for PyTorch 2.x speedup (if available)
195
+ try:
196
+ self.flow_model = torch.compile(self.flow_model, mode="reduce-overhead")
197
+ print("✓ Model compiled with torch.compile for speedup")
198
+ except Exception as e:
199
+ print(f"⚠️ Model compilation failed: {e}")
200
+
201
+ # Set models to training mode
202
+ self.compressor.train()
203
+ self.decompressor.train()
204
+ self.flow_model.train()
205
+
206
+ print(f"✓ Models initialized:")
207
+ print(f" Compressor parameters: {sum(p.numel() for p in self.compressor.parameters()):,}")
208
+ print(f" Decompressor parameters: {sum(p.numel() for p in self.decompressor.parameters()):,}")
209
+ print(f" Flow model parameters: {sum(p.numel() for p in self.flow_model.parameters()):,}")
210
+
211
+ def _initialize_data(self):
212
+ """Initialize datasets and dataloaders with FULL data."""
213
+ print("Initializing datasets with FULL data...")
214
+
215
+ # Create CFG dataset with FULL UniProt data
216
+ self.cfg_dataset = CFGFlowDataset(
217
+ embeddings_path=self.embeddings_path,
218
+ cfg_data_path=self.cfg_data_path,
219
+ use_masked_labels=True,
220
+ max_seq_len=MAX_SEQ_LEN,
221
+ device=self.device
222
+ )
223
+
224
+ # Create dataloader with optimized settings
225
+ self.dataloader = create_cfg_dataloader(
226
+ self.cfg_dataset,
227
+ batch_size=BATCH_SIZE,
228
+ shuffle=True,
229
+ num_workers=NUM_WORKERS
230
+ )
231
+
232
+ # Calculate total steps and validation intervals
233
+ self.total_steps = len(self.dataloader) * EPOCHS
234
+ self.validation_steps = VALIDATION_INTERVAL
235
+
236
+ print(f"✓ Dataset initialized with FULL data:")
237
+ print(f" Total samples: {len(self.cfg_dataset):,}")
238
+ print(f" Batch size: {BATCH_SIZE}")
239
+ print(f" Batches per epoch: {len(self.dataloader):,}")
240
+ print(f" Total training steps: {self.total_steps:,}")
241
+ print(f" Validation every: {self.validation_steps:,} steps")
242
+
243
+ def _initialize_optimizer(self):
244
+ """Initialize optimizer and learning rate scheduler."""
245
+ print("Initializing optimizer and scheduler...")
246
+
247
+ # Optimizer for flow model only (compressor/decompressor are frozen)
248
+ self.optimizer = optim.AdamW(
249
+ self.flow_model.parameters(),
250
+ lr=BASE_LR,
251
+ weight_decay=WEIGHT_DECAY,
252
+ betas=(0.9, 0.98), # Optimized betas for flow matching
253
+ eps=1e-6 # Lower epsilon for numerical stability
254
+ )
255
+
256
+ # Learning rate scheduler with proper warmup and cosine annealing
257
+ warmup_scheduler = LinearLR(
258
+ self.optimizer,
259
+ start_factor=0.1,
260
+ end_factor=1.0,
261
+ total_iters=WARMUP_STEPS
262
+ )
263
+
264
+ main_scheduler = CosineAnnealingLR(
265
+ self.optimizer,
266
+ T_max=self.total_steps - WARMUP_STEPS,
267
+ eta_min=LR_MIN
268
+ )
269
+
270
+ self.scheduler = SequentialLR(
271
+ self.optimizer,
272
+ schedulers=[warmup_scheduler, main_scheduler],
273
+ milestones=[WARMUP_STEPS]
274
+ )
275
+
276
+ print(f"✓ Optimizer initialized:")
277
+ print(f" Base LR: {BASE_LR}")
278
+ print(f" Min LR: {LR_MIN}")
279
+ print(f" Warmup steps: {WARMUP_STEPS}")
280
+ print(f" Weight decay: {WEIGHT_DECAY}")
281
+ print(f" Gradient clip norm: {GRADIENT_CLIP_NORM}")
282
+
283
+ def _preprocess_batch(self, batch):
284
+ """Preprocess a batch of data for training."""
285
+ # Extract data
286
+ embeddings = batch['embeddings'].to(self.device) # (B, L, ESM_DIM)
287
+ labels = batch['labels'].to(self.device) # (B,)
288
+
289
+ # Normalize embeddings
290
+ m, s = self.stats['mean'].to(self.device), self.stats['std'].to(self.device)
291
+ mn, mx = self.stats['min'].to(self.device), self.stats['max'].to(self.device)
292
+
293
+ embeddings = (embeddings - m) / (s + 1e-8)
294
+ embeddings = (embeddings - mn) / (mx - mn + 1e-8)
295
+
296
+ # Compress embeddings
297
+ with torch.no_grad():
298
+ compressed = self.compressor(embeddings) # (B, L, COMP_DIM)
299
+
300
+ return compressed, labels
301
+
302
+ def _compute_validation_metrics(self):
303
+ """Compute validation metrics on a subset of data."""
304
+ self.flow_model.eval()
305
+ val_losses = []
306
+
307
+ # Use a subset of data for validation
308
+ val_samples = min(1000, len(self.cfg_dataset))
309
+ val_indices = torch.randperm(len(self.cfg_dataset))[:val_samples]
310
+
311
+ with torch.no_grad():
312
+ for i in range(0, val_samples, BATCH_SIZE):
313
+ batch_indices = val_indices[i:i+BATCH_SIZE]
314
+ batch_data = [self.cfg_dataset[idx] for idx in batch_indices]
315
+
316
+ # Collate batch
317
+ embeddings = torch.stack([item['embedding'] for item in batch_data])
318
+ labels = torch.stack([item['label'] for item in batch_data])
319
+
320
+ # Preprocess
321
+ compressed, labels = self._preprocess_batch({
322
+ 'embeddings': embeddings,
323
+ 'labels': labels
324
+ })
325
+
326
+ B, L, D = compressed.shape
327
+
328
+ # Sample random time
329
+ t = torch.rand(B, device=self.device)
330
+
331
+ # Sample random noise
332
+ eps = torch.randn_like(compressed)
333
+
334
+ # Compute target
335
+ xt = (1 - t.unsqueeze(-1).unsqueeze(-1)) * compressed + t.unsqueeze(-1).unsqueeze(-1) * eps
336
+
337
+ # Predict vector field
338
+ vt_pred = self.flow_model(xt, t, labels=labels)
339
+
340
+ # Target vector field
341
+ vt_target = eps - compressed
342
+
343
+ # Compute loss
344
+ loss = F.mse_loss(vt_pred, vt_target)
345
+ val_losses.append(loss.item())
346
+
347
+ self.flow_model.train()
348
+ return np.mean(val_losses)
349
+
350
+ def train_flow_matching(self):
351
+ """Train the flow matching model with FULL data and optimizations."""
352
+ print(f"🚀 Starting Optimized Single GPU Flow Matching Training with FULL DATA")
353
+ print(f"GPU: {GPU_ID}")
354
+ print(f"Total iterations: {EPOCHS}")
355
+ print(f"Batch size: {BATCH_SIZE}")
356
+ print(f"Total samples: {len(self.cfg_dataset):,}")
357
+ print(f"Mixed precision: {USE_MIXED_PRECISION}")
358
+ print(f"Estimated time: ~8-10 hours (overnight training with ALL data)")
359
+ print("=" * 60)
360
+
361
+ # Training loop
362
+ best_loss = float('inf')
363
+ losses = []
364
+ val_losses = []
365
+ global_step = 0
366
+ start_time = time.time()
367
+
368
+ for epoch in tqdm(range(EPOCHS), desc="Training Flow Model"):
369
+ epoch_losses = []
370
+ epoch_start_time = time.time()
371
+
372
+ for batch_idx, batch in enumerate(self.dataloader):
373
+ # Preprocess batch
374
+ compressed, labels = self._preprocess_batch(batch)
375
+ B, L, D = compressed.shape
376
+
377
+ # CFG training: randomly mask some labels for unconditional training
378
+ if torch.rand(1).item() < CFG_DROPOUT_RATE:
379
+ labels = torch.full_like(labels, fill_value=-1) # Unconditional
380
+
381
+ # Sample random time
382
+ t = torch.rand(B, device=self.device) # (B,)
383
+
384
+ # Sample random noise
385
+ eps = torch.randn_like(compressed) # (B, L, D)
386
+
387
+ # Compute target: x_t = (1-t) * x_0 + t * eps
388
+ xt = (1 - t.unsqueeze(-1).unsqueeze(-1)) * compressed + t.unsqueeze(-1).unsqueeze(-1) * eps
389
+
390
+ # Forward pass with mixed precision
391
+ if USE_MIXED_PRECISION:
392
+ with autocast(dtype=torch.bfloat16):
393
+ vt_pred = self.flow_model(xt, t, labels=labels) # (B, L, D)
394
+ vt_target = eps - compressed # (B, L, D)
395
+ loss = F.mse_loss(vt_pred, vt_target)
396
+
397
+ # Backward pass with gradient scaling
398
+ self.optimizer.zero_grad()
399
+ self.scaler.scale(loss).backward()
400
+
401
+ # Gradient clipping
402
+ self.scaler.unscale_(self.optimizer)
403
+ torch.nn.utils.clip_grad_norm_(self.flow_model.parameters(), max_norm=GRADIENT_CLIP_NORM)
404
+
405
+ self.scaler.step(self.optimizer)
406
+ self.scaler.update()
407
+ else:
408
+ # Standard training
409
+ vt_pred = self.flow_model(xt, t, labels=labels) # (B, L, D)
410
+ vt_target = eps - compressed # (B, L, D)
411
+ loss = F.mse_loss(vt_pred, vt_target)
412
+
413
+ # Backward pass
414
+ self.optimizer.zero_grad()
415
+ loss.backward()
416
+
417
+ # Gradient clipping
418
+ torch.nn.utils.clip_grad_norm_(self.flow_model.parameters(), max_norm=GRADIENT_CLIP_NORM)
419
+
420
+ self.optimizer.step()
421
+
422
+ # Update learning rate
423
+ self.scheduler.step()
424
+
425
+ epoch_losses.append(loss.item())
426
+ global_step += 1
427
+
428
+ # Logging
429
+ if batch_idx % 100 == 0:
430
+ current_lr = self.scheduler.get_last_lr()[0]
431
+ elapsed_time = time.time() - start_time
432
+ steps_per_sec = global_step / elapsed_time
433
+ eta_hours = (self.total_steps - global_step) / steps_per_sec / 3600
434
+
435
+ print(f"Epoch {epoch:4d} | Step {global_step:6d}/{self.total_steps:6d} | "
436
+ f"Loss: {loss.item():.6f} | LR: {current_lr:.2e} | "
437
+ f"Speed: {steps_per_sec:.1f} steps/s | ETA: {eta_hours:.1f}h")
438
+
439
+ # Log to wandb
440
+ if self.use_wandb:
441
+ wandb.log({
442
+ 'train/loss': loss.item(),
443
+ 'train/learning_rate': current_lr,
444
+ 'train/steps_per_sec': steps_per_sec,
445
+ 'train/global_step': global_step
446
+ })
447
+
448
+ # Validation
449
+ if global_step % self.validation_steps == 0:
450
+ val_loss = self._compute_validation_metrics()
451
+ val_losses.append(val_loss)
452
+
453
+ print(f"Validation at step {global_step}: Loss = {val_loss:.6f}")
454
+
455
+ if self.use_wandb:
456
+ wandb.log({
457
+ 'val/loss': val_loss,
458
+ 'val/global_step': global_step
459
+ })
460
+
461
+ # Early stopping check
462
+ if val_loss < best_loss:
463
+ best_loss = val_loss
464
+ self._save_checkpoint(epoch, val_loss, global_step, is_final=False, is_best=True)
465
+
466
+ # Compute epoch statistics
467
+ avg_loss = np.mean(epoch_losses)
468
+ losses.append(avg_loss)
469
+ epoch_time = time.time() - epoch_start_time
470
+
471
+ print(f"Epoch {epoch:4d} | Avg Loss: {avg_loss:.6f} | "
472
+ f"LR: {self.scheduler.get_last_lr()[0]:.2e} | "
473
+ f"Time: {epoch_time:.1f}s | Samples: {len(self.cfg_dataset):,}")
474
+
475
+ # Save checkpoint
476
+ if (epoch + 1) % CHECKPOINT_INTERVAL == 0:
477
+ self._save_checkpoint(epoch, avg_loss, global_step, is_final=True)
478
+
479
+ # Save final model
480
+ self._save_checkpoint(EPOCHS - 1, losses[-1], global_step, is_final=True)
481
+
482
+ total_time = time.time() - start_time
483
+ print("=" * 60)
484
+ print("🎉 Optimized Training Complete with FULL DATA!")
485
+ print(f"Best validation loss: {best_loss:.6f}")
486
+ print(f"Total training time: {total_time/3600:.1f} hours")
487
+ print(f"Total samples used: {len(self.cfg_dataset):,}")
488
+ print(f"Final model saved as: amp_flow_model_final_optimized.pth")
489
+
490
+ return losses, val_losses
491
+
492
+ def _save_checkpoint(self, step, loss, global_step, is_final=False, is_best=False):
493
+ """Save model checkpoint."""
494
+ # Create output directory if it doesn't exist
495
+ output_dir = '/data2/edwardsun/flow_checkpoints'
496
+ os.makedirs(output_dir, exist_ok=True)
497
+
498
+ if is_best:
499
+ filename = os.path.join(output_dir, 'amp_flow_model_best_optimized.pth')
500
+ elif is_final:
501
+ filename = os.path.join(output_dir, 'amp_flow_model_final_optimized.pth')
502
+ else:
503
+ filename = os.path.join(output_dir, f'amp_flow_checkpoint_optimized_step_{step:04d}.pth')
504
+
505
+ checkpoint = {
506
+ 'step': step,
507
+ 'global_step': global_step,
508
+ 'loss': loss,
509
+ 'flow_model_state_dict': self.flow_model.state_dict(),
510
+ 'optimizer_state_dict': self.optimizer.state_dict(),
511
+ 'scheduler_state_dict': self.scheduler.state_dict(),
512
+ 'stats': self.stats,
513
+ 'total_samples': len(self.cfg_dataset),
514
+ 'config': {
515
+ 'batch_size': BATCH_SIZE,
516
+ 'epochs': EPOCHS,
517
+ 'base_lr': BASE_LR,
518
+ 'lr_min': LR_MIN,
519
+ 'warmup_steps': WARMUP_STEPS,
520
+ 'mixed_precision': USE_MIXED_PRECISION,
521
+ 'gradient_clip': GRADIENT_CLIP_NORM,
522
+ 'weight_decay': WEIGHT_DECAY
523
+ }
524
+ }
525
+
526
+ torch.save(checkpoint, filename)
527
+ print(f"✓ Checkpoint saved: {filename} (loss: {loss:.6f}, step: {global_step})")
528
+
529
+ def main():
530
+ """Main training function."""
531
+ global BATCH_SIZE, EPOCHS
532
+
533
+ parser = argparse.ArgumentParser(description='Optimized Single GPU AMP Flow Training with FULL DATA')
534
+ parser.add_argument('--embeddings', default='/data2/edwardsun/flow_project/peptide_embeddings/',
535
+ help='Path to peptide embeddings directory')
536
+ parser.add_argument('--cfg_data', default='/data2/edwardsun/flow_project/test_uniprot_processed/uniprot_processed_data.json',
537
+ help='Path to FULL CFG data file')
538
+ parser.add_argument('--use_wandb', action='store_true', help='Use wandb for logging')
539
+ parser.add_argument('--batch_size', type=int, default=BATCH_SIZE, help='Batch size for training')
540
+ parser.add_argument('--epochs', type=int, default=EPOCHS, help='Number of training epochs')
541
+
542
+ args = parser.parse_args()
543
+
544
+ # Update global variables if provided
545
+ if args.batch_size != BATCH_SIZE:
546
+ BATCH_SIZE = args.batch_size
547
+ if args.epochs != EPOCHS:
548
+ EPOCHS = args.epochs
549
+
550
+ print(f"Starting optimized training with batch_size={BATCH_SIZE}, epochs={EPOCHS}")
551
+
552
+ # Initialize trainer
553
+ trainer = AMPFlowTrainerSingleGPUFullData(args.embeddings, args.cfg_data, args.use_wandb)
554
+
555
+ # Start training
556
+ losses, val_losses = trainer.train_flow_matching()
557
+
558
+ print("Optimized training completed successfully with FULL DATA!")
559
+
560
+ if __name__ == "__main__":
561
+ main()
apex/AMP_DL_model_twohead.py ADDED
@@ -0,0 +1,113 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import numpy as np
2
+ import torch
3
+ import torch.nn as nn
4
+ import torch.nn.functional as F
5
+ import math, copy, time
6
+ from torch.autograd import Variable
7
+
8
+ class PeptideEmbeddings(nn.Module):
9
+ def __init__(self, emb):
10
+ super().__init__()
11
+ self.aa_embedding = nn.Embedding.from_pretrained(torch.FloatTensor(emb), padding_idx=0)
12
+ def forward(self, x):
13
+ out = self.aa_embedding(x)
14
+ return out
15
+
16
+ class AMP_model(nn.Module):
17
+ def __init__(self, emb, emb_size, num_rnn_layers, dim_h, dim_latent, num_fc_layers, num_task):
18
+ super().__init__()
19
+
20
+ self.peptideEmb = PeptideEmbeddings(emb=emb)
21
+ self.dim_emb = emb_size
22
+ self.dim_h = dim_h
23
+ self.dropout = 0.1
24
+ self.dim_latent = dim_latent
25
+ max_len = 52
26
+
27
+ self.rnn = nn.GRU(emb_size, dim_h, num_layers=num_rnn_layers, batch_first=True, dropout=0.1, bidirectional=True)
28
+ self.layernorm = nn.LayerNorm(dim_h * 2)
29
+ self.attn1 = nn.Linear(dim_h * 2 + emb_size, max_len)
30
+ self.attn2 = nn.Linear(dim_h * 2, 1)
31
+
32
+ self.fc0 = nn.Linear(dim_h * 2, dim_h)
33
+
34
+ self.fc1 = nn.Linear(dim_h, dim_latent)
35
+ self.fc2 = nn.Linear(dim_latent, int(dim_latent / 2))
36
+ self.fc3 = nn.Linear(int(dim_latent / 2), int(dim_latent / 4))
37
+ self.fc4 = nn.Linear(int(dim_latent / 4), num_task)
38
+
39
+ self.ln1 = nn.LayerNorm(dim_latent)
40
+ self.ln2 = nn.LayerNorm(int(dim_latent / 2))
41
+ self.ln3 = nn.LayerNorm(int(dim_latent / 4))
42
+
43
+ self.dp1 = nn.Dropout(0.1)#nn.Dropout(0.2)
44
+ self.dp2 = nn.Dropout(0.1)#nn.Dropout(0.2)
45
+ self.dp3 = nn.Dropout(0.1)#nn.Dropout(0.2)
46
+
47
+
48
+
49
+ self.fc1_ = nn.Linear(dim_h, dim_latent)
50
+ self.fc2_ = nn.Linear(dim_latent, int(dim_latent / 2))
51
+ self.fc3_ = nn.Linear(int(dim_latent / 2), int(dim_latent / 4))
52
+ self.fc4_ = nn.Linear(int(dim_latent / 4), 1)
53
+
54
+ self.ln1_ = nn.LayerNorm(dim_latent)
55
+ self.ln2_ = nn.LayerNorm(int(dim_latent / 2))
56
+ self.ln3_ = nn.LayerNorm(int(dim_latent / 4))
57
+
58
+ self.dp1_ = nn.Dropout(0.1)#nn.Dropout(0.2)
59
+ self.dp2_ = nn.Dropout(0.1)#nn.Dropout(0.2)
60
+ self.dp3_ = nn.Dropout(0.1)#nn.Dropout(0.2)
61
+
62
+
63
+
64
+
65
+ def forward(self, x):
66
+
67
+ x = self.peptideEmb(x)
68
+ #h = self.initH(x.shape[0])
69
+ #out, h = self.rnn(x, h)
70
+ out, h = self.rnn(x)
71
+ out = self.layernorm(out)
72
+
73
+ attn_weights1 = F.softmax(self.attn1(torch.cat((out, x), 2)), dim=2) #to be tested: masked softmax
74
+ attn_weights1.permute(0, 2, 1)
75
+ out = torch.bmm(attn_weights1, out)
76
+ attn_weights2 = F.softmax(self.attn2(out), dim=1) #to be tested: masked softmax
77
+ out = torch.sum(attn_weights2 * out, dim=1) #to be test: masked sum
78
+
79
+ out = self.fc0(out)
80
+
81
+ out = self.dp1(F.relu(self.ln1(self.fc1(out))))
82
+ out = self.dp2(F.relu(self.ln2(self.fc2(out))))
83
+ out = self.dp3(F.relu(self.ln3(self.fc3(out))))
84
+ out = self.fc4(out)
85
+
86
+ return F.relu(out)
87
+
88
+ def predict(self, x):
89
+ return self.forward(x)
90
+
91
+
92
+ def cls_forward(self, x):
93
+
94
+ x = self.peptideEmb(x)
95
+ #h = self.initH(x.shape[0])
96
+ #out, h = self.rnn(x, h)
97
+ out, h = self.rnn(x)
98
+ out = self.layernorm(out)
99
+
100
+ attn_weights1 = F.softmax(self.attn1(torch.cat((out, x), 2)), dim=2) #to be tested: masked softmax
101
+ attn_weights1.permute(0, 2, 1)
102
+ out = torch.bmm(attn_weights1, out)
103
+ attn_weights2 = F.softmax(self.attn2(out), dim=1) #to be tested: masked softmax
104
+ out = torch.sum(attn_weights2 * out, dim=1) #to be test: masked sum
105
+
106
+ out = self.fc0(out)
107
+
108
+ out = self.dp1_(F.relu(self.ln1_(self.fc1_(out))))
109
+ out = self.dp2_(F.relu(self.ln2_(self.fc2_(out))))
110
+ out = self.dp3_(F.relu(self.ln3_(self.fc3_(out))))
111
+ out = self.fc4_(out)
112
+
113
+ return out
apex/Predicted_MICs.csv ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ,E. coli ATCC11775,P. aeruginosa PAO1,P. aeruginosa PA14,S. aureus ATCC12600,E. coli AIG221,E. coli AIG222,K. pneumoniae ATCC13883,A. baumannii ATCC19606,A. muciniphila ATCC BAA-835,B. fragilis ATCC25285,B. vulgatus ATCC8482,C. aerofaciens ATCC25986,C. scindens ATCC35704,B. thetaiotaomicron ATCC29148,B. thetaiotaomicron Complemmented,B. thetaiotaomicron Mutant,B. uniformis ATCC8492,B. eggerthi ATCC27754,C. spiroforme ATCC29900,P. distasonis ATCC8503,P. copri DSMZ18205,B. ovatus ATCC8483,E. rectale ATCC33656,C. symbiosum,R. obeum,R. torques,S. aureus (ATCC BAA-1556) - MRSA,vancomycin-resistant E. faecalis ATCC700802,vancomycin-resistant E. faecium ATCC700221,E. coli Nissle,Salmonella enterica ATCC 9150 (BEIRES NR-515),Salmonella enterica (BEIRES NR-170),Salmonella enterica ATCC 9150 (BEIRES NR-174),L. monocytogenes ATCC 19111 (BEIRES NR-106)
2
+ IPKTYDKRWDDQCWLAITGRYHGITTPPCCSWVV,134.37933,133.67014,133.47633,132.06357,139.48221,136.32346,137.199,126.84656,126.05934,137.75461,138.1181,142.69162,137.7662,139.89436,133.77281,134.47473,144.46127,133.90617,136.54572,138.02174,131.41934,133.84996,131.85303,302.2381,412.25684,296.79123,137.54033,135.57431,135.85416,2526.2712,100.91771,1219.0903,676.08905,124.86306
3
+ KWLIYYNEGHLMVKYMLTISVRIPEGDNPNIQLHGSIGSR,113.27322,113.816246,105.8549,121.385605,118.87728,117.084915,121.39231,97.408005,114.83742,126.84959,121.16374,117.40218,118.76237,127.41105,124.46525,122.88039,120.520775,116.183304,128.06148,109.68715,118.8102,127.2724,118.91581,295.51852,416.33197,292.758,128.81776,132.28825,108.17039,2471.853,80.87459,1018.7287,640.92346,103.58418
4
+ VGHAQVASPDLHWDGHGNHLIPWTPCYSHEMNPTMPPA,139.44724,136.14822,134.20648,136.02388,141.02017,141.60233,139.74214,136.73692,135.3081,139.60617,140.51639,138.64197,137.41493,141.17766,137.0292,136.66333,143.8462,136.84438,136.74908,135.07846,133.97592,136.8179,137.57657,308.34232,440.25397,311.01697,140.81741,141.47835,136.60622,2779.5024,107.65519,1268.6604,722.2069,134.72772
5
+ RIWETQGSDCIRDGIDSTGPPFMVMFHAAGWRQVHSK,127.36061,130.93207,129.1912,133.74936,130.3024,128.63132,132.07448,114.99402,118.79469,135.40488,129.4568,131.13911,129.87733,135.09549,132.68257,132.721,141.9473,132.37192,133.2458,127.67586,126.89638,133.11191,132.14206,276.87024,390.039,291.6936,138.57545,139.01753,126.76943,2116.5493,90.535255,935.9032,628.7894,112.784706
6
+ IYEDYEFVRMPTHMTDFMQSPDQQNPKHMWTLCFDHT,138.19647,136.43967,134.19327,136.39677,140.33774,140.78696,138.82767,134.45322,135.36095,137.68173,142.58617,141.29187,142.04684,139.79361,135.85521,134.95773,140.02855,135.06549,135.18292,135.65457,126.83917,135.78409,134.22078,295.92462,423.1288,300.80344,139.7175,138.30704,130.73404,2520.869,103.63745,1122.4617,687.5478,133.08084
7
+ CPWVQHFWAPPWAHCICIEGPEESGWATIEPMVVGT,137.11711,136.29323,134.96661,138.4914,142.4903,139.87944,138.16125,131.98262,139.51706,141.34677,142.34552,143.17244,142.15126,143.51427,137.9091,138.8798,148.41785,137.80379,137.64511,136.5797,134.00578,137.73303,140.58127,308.2508,410.4842,304.77887,142.90944,142.9198,144.48164,2608.0327,113.228615,1234.4288,722.4048,132.68794
8
+ FPLTMHGEFSQNLVWTITQHLVKRWCYTLSPKFCHRY,132.82092,130.18073,128.30576,137.0156,136.58134,136.30899,136.51419,119.462135,117.975525,137.00491,138.4794,137.34769,143.57109,138.3506,133.10274,132.72733,138.9192,130.98361,137.97784,130.82622,128.28854,134.46898,138.15057,305.70197,424.16177,313.93192,140.46169,136.51822,135.93553,2973.7498,98.21229,1215.7686,712.4238,124.42888
9
+ SRSEDQILATYWRTSTCYFNQLWFQRLTGQQRICC,132.35309,135.08835,133.4797,133.83403,133.55246,133.56703,133.06903,127.04684,118.50051,137.62326,134.35898,139.048,139.2864,136.87164,133.12827,133.68544,138.84805,135.44177,134.43094,132.8223,130.07999,134.2626,134.12837,265.18292,371.36026,286.07922,137.51228,138.9885,130.92508,2142.12,94.96339,856.95593,609.7506,119.94425
10
+ QLELPCCIETWKLNVAFRCPFHKDLKRLGLYSRDKW,96.86034,112.103455,106.62036,130.5871,97.71516,91.49239,112.85535,70.78958,77.14095,129.05354,115.55772,111.96426,105.05908,126.89999,123.53723,117.43738,116.70744,106.60911,124.693054,112.9424,114.001854,123.950424,114.90995,303.3677,364.20624,275.29742,134.68173,130.36494,114.740585,2476.3755,62.907795,913.19275,507.56738,83.11806
11
+ PPMDCVYAIKTTSDHQSTMFIIPRYTHMYGNLQLWCVYCT,135.86214,137.20145,135.70978,134.45326,140.82416,139.36844,138.8737,130.17923,136.19333,138.40355,136.87325,143.45218,132.80997,140.24698,135.89694,135.31473,144.77373,138.8391,138.01976,137.25049,130.33195,135.40433,133.84612,288.31738,409.42154,300.98975,139.98962,136.1305,135.23387,2705.215,105.53652,1263.1531,691.8714,129.66193
apex/README.md ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # APEX - Molecular de-extinction of antibiotics enabled by deep learning
2
+
3
+ ## Predict AMPs using APEX
4
+ By running predict.py, species-specific antmicrobial activties (MICs) of peptides in test_seqs.txt will be generated and saved in Predicted_MICs.csv. To predict antmicrobial activties for novel peptides, you can replace the peptides in test_seqs.txt with the peptides of your interest. Alternatively, you can change line 84 of predict.py to the path of your peptide file. Please make sure that in this file, each line corresponds to a single peptide sequence (<= 50 amino acids in length).
5
+
6
+
7
+ ## Software version
8
+ pytorch: 1.11.0+cu113 (These code should run only on a CUDA-capable device)
9
+
10
+ ## Configuration
11
+ conda create -n apex python==3.9
12
+
13
+ conda activate apex
14
+
15
+ pip install torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0 --extra-index-url https://download.pytorch.org/whl/cu113
16
+
17
+ pip install -r requirement.txt
18
+
19
+ ## Running
20
+ python predict.py test_seqs.txt
21
+
22
+ ## Contacts
23
+ If you have any questions or comments, please feel free to email Fangping Wan (fangping[dot]wan[at]pennmedicine[dot]upenn[dot]edu) and/or César de la Fuente (cfuente[at]pennmedicine[dot]upenn[dot]edu).
24
+
apex/aaindex1.csv ADDED
@@ -0,0 +1,567 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Description,A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V
2
+ ANDN920101,4.35,4.38,4.75,4.76,4.65,4.37,4.29,3.97,4.63,3.95,4.17,4.36,4.52,4.66,4.44,4.50,4.35,4.70,4.60,3.95
3
+ ARGP820101,0.61,0.60,0.06,0.46,1.07,0.,0.47,0.07,0.61,2.22,1.53,1.15,1.18,2.02,1.95,0.05,0.05,2.65,1.88,1.32
4
+ ARGP820102,1.18,0.20,0.23,0.05,1.89,0.72,0.11,0.49,0.31,1.45,3.23,0.06,2.67,1.96,0.76,0.97,0.84,0.77,0.39,1.08
5
+ ARGP820103,1.56,0.45,0.27,0.14,1.23,0.51,0.23,0.62,0.29,1.67,2.93,0.15,2.96,2.03,0.76,0.81,0.91,1.08,0.68,1.14
6
+ BEGF750101,1.,0.52,0.35,0.44,0.06,0.44,0.73,0.35,0.60,0.73,1.,0.60,1.,0.60,0.06,0.35,0.44,0.73,0.44,0.82
7
+ BEGF750102,0.77,0.72,0.55,0.65,0.65,0.72,0.55,0.65,0.83,0.98,0.83,0.55,0.98,0.98,0.55,0.55,0.83,0.77,0.83,0.98
8
+ BEGF750103,0.37,0.84,0.97,0.97,0.84,0.64,0.53,0.97,0.75,0.37,0.53,0.75,0.64,0.53,0.97,0.84,0.75,0.97,0.84,0.37
9
+ BHAR880101,0.357,0.529,0.463,0.511,0.346,0.493,0.497,0.544,0.323,0.462,0.365,0.466,0.295,0.314,0.509,0.507,0.444,0.305,0.420,0.386
10
+ BIGC670101,52.6,109.1,75.7,68.4,68.3,89.7,84.7,36.3,91.9,102.0,102.0,105.1,97.7,113.9,73.6,54.9,71.2,135.4,116.2,85.1
11
+ BIOV880101,16.,-70.,-74.,-78.,168.,-73.,-106.,-13.,50.,151.,145.,-141.,124.,189.,-20.,-70.,-38.,145.,53.,123.
12
+ BIOV880102,44.,-68.,-72.,-91.,90.,-117.,-139.,-8.,47.,100.,108.,-188.,121.,148.,-36.,-60.,-54.,163.,22.,117.
13
+ BROC820101,7.3,-3.6,-5.7,-2.9,-9.2,-0.3,-7.1,-1.2,-2.1,6.6,20.0,-3.7,5.6,19.2,5.1,-4.1,0.8,16.3,5.9,3.5
14
+ BROC820102,3.9,3.2,-2.8,-2.8,-14.3,1.8,-7.5,-2.3,2.0,11.0,15.0,-2.5,4.1,14.7,5.6,-3.5,1.1,17.8,3.8,2.1
15
+ BULH740101,-0.20,-0.12,0.08,-0.20,-0.45,0.16,-0.30,0.00,-0.12,-2.26,-2.46,-0.35,-1.47,-2.33,-0.98,-0.39,-0.52,-2.01,-2.24,-1.56
16
+ BULH740102,0.691,0.728,0.596,0.558,0.624,0.649,0.632,0.592,0.646,0.809,0.842,0.767,0.709,0.756,0.730,0.594,0.655,0.743,0.743,0.777
17
+ BUNA790101,8.249,8.274,8.747,8.410,8.312,8.411,8.368,8.391,8.415,8.195,8.423,8.408,8.418,8.228,0.,8.380,8.236,8.094,8.183,8.436
18
+ BUNA790102,4.349,4.396,4.755,4.765,4.686,4.373,4.295,3.972,4.630,4.224,4.385,4.358,4.513,4.663,4.471,4.498,4.346,4.702,4.604,4.184
19
+ BUNA790103,6.5,6.9,7.5,7.0,7.7,6.0,7.0,5.6,8.0,7.0,6.5,6.5,0.,9.4,0.,6.5,6.9,0.,6.8,7.0
20
+ BURA740101,0.486,0.262,0.193,0.288,0.200,0.418,0.538,0.120,0.400,0.370,0.420,0.402,0.417,0.318,0.208,0.200,0.272,0.462,0.161,0.379
21
+ BURA740102,0.288,0.362,0.229,0.271,0.533,0.327,0.262,0.312,0.200,0.411,0.400,0.265,0.375,0.318,0.340,0.354,0.388,0.231,0.429,0.495
22
+ CHAM810101,0.52,0.68,0.76,0.76,0.62,0.68,0.68,0.00,0.70,1.02,0.98,0.68,0.78,0.70,0.36,0.53,0.50,0.70,0.70,0.76
23
+ CHAM820101,0.046,0.291,0.134,0.105,0.128,0.180,0.151,0.000,0.230,0.186,0.186,0.219,0.221,0.290,0.131,0.062,0.108,0.409,0.298,0.140
24
+ CHAM820102,-0.368,-1.03,0.,2.06,4.53,0.731,1.77,-0.525,0.,0.791,1.07,0.,0.656,1.06,-2.24,-0.524,0.,1.60,4.91,0.401
25
+ CHAM830101,0.71,1.06,1.37,1.21,1.19,0.87,0.84,1.52,1.07,0.66,0.69,0.99,0.59,0.71,1.61,1.34,1.08,0.76,1.07,0.63
26
+ CHAM830102,-0.118,0.124,0.289,0.048,0.083,-0.105,-0.245,0.104,0.138,0.230,-0.052,0.032,-0.258,0.015,0.,0.225,0.166,0.158,0.094,0.513
27
+ CHAM830103,0.,1.,1.,1.,1.,1.,1.,0.,1.,2.,1.,1.,1.,1.,0.,1.,2.,1.,1.,2.
28
+ CHAM830104,0.,1.,1.,1.,0.,1.,1.,0.,1.,1.,2.,1.,1.,1.,0.,0.,0.,1.,1.,0.
29
+ CHAM830105,0.,1.,0.,0.,0.,1.,1.,0.,1.,0.,0.,1.,1.,1.,0.,0.,0.,1.5,1.,0.
30
+ CHAM830106,0.,5.,2.,2.,1.,3.,3.,0.,3.,2.,2.,4.,3.,4.,0.,1.,1.,5.,5.,1.
31
+ CHAM830107,0.,0.,1.,1.,0.,0.,1.,1.,0.,0.,0.,0.,0.,0.,0.,0.,0.,0.,0.,0.
32
+ CHAM830108,0.,1.,1.,0.,1.,1.,0.,0.,1.,0.,0.,1.,1.,1.,0.,0.,0.,1.,1.,0.
33
+ CHOC750101,91.5,202.0,135.2,124.5,117.7,161.1,155.1,66.4,167.3,168.8,167.9,171.3,170.8,203.4,129.3,99.1,122.1,237.6,203.6,141.7
34
+ CHOC760101,115.,225.,160.,150.,135.,180.,190.,75.,195.,175.,170.,200.,185.,210.,145.,115.,140.,255.,230.,155.
35
+ CHOC760102,25.,90.,63.,50.,19.,71.,49.,23.,43.,18.,23.,97.,31.,24.,50.,44.,47.,32.,60.,18.
36
+ CHOC760103,0.38,0.01,0.12,0.15,0.45,0.07,0.18,0.36,0.17,0.60,0.45,0.03,0.40,0.50,0.18,0.22,0.23,0.27,0.15,0.54
37
+ CHOC760104,0.20,0.00,0.03,0.04,0.22,0.01,0.03,0.18,0.02,0.19,0.16,0.00,0.11,0.14,0.04,0.08,0.08,0.04,0.03,0.18
38
+ CHOP780101,0.66,0.95,1.56,1.46,1.19,0.98,0.74,1.56,0.95,0.47,0.59,1.01,0.60,0.60,1.52,1.43,0.96,0.96,1.14,0.50
39
+ CHOP780201,1.42,0.98,0.67,1.01,0.70,1.11,1.51,0.57,1.00,1.08,1.21,1.16,1.45,1.13,0.57,0.77,0.83,1.08,0.69,1.06
40
+ CHOP780202,0.83,0.93,0.89,0.54,1.19,1.10,0.37,0.75,0.87,1.60,1.30,0.74,1.05,1.38,0.55,0.75,1.19,1.37,1.47,1.70
41
+ CHOP780203,0.74,1.01,1.46,1.52,0.96,0.96,0.95,1.56,0.95,0.47,0.50,1.19,0.60,0.66,1.56,1.43,0.98,0.60,1.14,0.59
42
+ CHOP780204,1.29,0.44,0.81,2.02,0.66,1.22,2.44,0.76,0.73,0.67,0.58,0.66,0.71,0.61,2.01,0.74,1.08,1.47,0.68,0.61
43
+ CHOP780205,1.20,1.25,0.59,0.61,1.11,1.22,1.24,0.42,1.77,0.98,1.13,1.83,1.57,1.10,0.00,0.96,0.75,0.40,0.73,1.25
44
+ CHOP780206,0.70,0.34,1.42,0.98,0.65,0.75,1.04,1.41,1.22,0.78,0.85,1.01,0.83,0.93,1.10,1.55,1.09,0.62,0.99,0.75
45
+ CHOP780207,0.52,1.24,1.64,1.06,0.94,0.70,0.59,1.64,1.86,0.87,0.84,1.49,0.52,1.04,1.58,0.93,0.86,0.16,0.96,0.32
46
+ CHOP780208,0.86,0.90,0.66,0.38,0.87,1.65,0.35,0.63,0.54,1.94,1.30,1.00,1.43,1.50,0.66,0.63,1.17,1.49,1.07,1.69
47
+ CHOP780209,0.75,0.90,1.21,0.85,1.11,0.65,0.55,0.74,0.90,1.35,1.27,0.74,0.95,1.50,0.40,0.79,0.75,1.19,1.96,1.79
48
+ CHOP780210,0.67,0.89,1.86,1.39,1.34,1.09,0.92,1.46,0.78,0.59,0.46,1.09,0.52,0.30,1.58,1.41,1.09,0.48,1.23,0.42
49
+ CHOP780211,0.74,1.05,1.13,1.32,0.53,0.77,0.85,1.68,0.96,0.53,0.59,0.82,0.85,0.44,1.69,1.49,1.16,1.59,1.01,0.59
50
+ CHOP780212,0.060,0.070,0.161,0.147,0.149,0.074,0.056,0.102,0.140,0.043,0.061,0.055,0.068,0.059,0.102,0.120,0.086,0.077,0.082,0.062
51
+ CHOP780213,0.076,0.106,0.083,0.110,0.053,0.098,0.060,0.085,0.047,0.034,0.025,0.115,0.082,0.041,0.301,0.139,0.108,0.013,0.065,0.048
52
+ CHOP780214,0.035,0.099,0.191,0.179,0.117,0.037,0.077,0.190,0.093,0.013,0.036,0.072,0.014,0.065,0.034,0.125,0.065,0.064,0.114,0.028
53
+ CHOP780215,0.058,0.085,0.091,0.081,0.128,0.098,0.064,0.152,0.054,0.056,0.070,0.095,0.055,0.065,0.068,0.106,0.079,0.167,0.125,0.053
54
+ CHOP780216,0.64,1.05,1.56,1.61,0.92,0.84,0.80,1.63,0.77,0.29,0.36,1.13,0.51,0.62,2.04,1.52,0.98,0.48,1.08,0.43
55
+ CIDH920101,-0.45,-0.24,-0.20,-1.52,0.79,-0.99,-0.80,-1.00,1.07,0.76,1.29,-0.36,1.37,1.48,-0.12,-0.98,-0.70,1.38,1.49,1.26
56
+ CIDH920102,-0.08,-0.09,-0.70,-0.71,0.76,-0.40,-1.31,-0.84,0.43,1.39,1.24,-0.09,1.27,1.53,-0.01,-0.93,-0.59,2.25,1.53,1.09
57
+ CIDH920103,0.36,-0.52,-0.90,-1.09,0.70,-1.05,-0.83,-0.82,0.16,2.17,1.18,-0.56,1.21,1.01,-0.06,-0.60,-1.20,1.31,1.05,1.21
58
+ CIDH920104,0.17,-0.70,-0.90,-1.05,1.24,-1.20,-1.19,-0.57,-0.25,2.06,0.96,-0.62,0.60,1.29,-0.21,-0.83,-0.62,1.51,0.66,1.21
59
+ CIDH920105,0.02,-0.42,-0.77,-1.04,0.77,-1.10,-1.14,-0.80,0.26,1.81,1.14,-0.41,1.00,1.35,-0.09,-0.97,-0.77,1.71,1.11,1.13
60
+ COHE430101,0.75,0.70,0.61,0.60,0.61,0.67,0.66,0.64,0.67,0.90,0.90,0.82,0.75,0.77,0.76,0.68,0.70,0.74,0.71,0.86
61
+ CRAJ730101,1.33,0.79,0.72,0.97,0.93,1.42,1.66,0.58,1.49,0.99,1.29,1.03,1.40,1.15,0.49,0.83,0.94,1.33,0.49,0.96
62
+ CRAJ730102,1.00,0.74,0.75,0.89,0.99,0.87,0.37,0.56,0.36,1.75,1.53,1.18,1.40,1.26,0.36,0.65,1.15,0.84,1.41,1.61
63
+ CRAJ730103,0.60,0.79,1.42,1.24,1.29,0.92,0.64,1.38,0.95,0.67,0.70,1.10,0.67,1.05,1.47,1.26,1.05,1.23,1.35,0.48
64
+ DAWD720101,2.5,7.5,5.0,2.5,3.0,6.0,5.0,0.5,6.0,5.5,5.5,7.0,6.0,6.5,5.5,3.0,5.0,7.0,7.0,5.0
65
+ DAYM780101,8.6,4.9,4.3,5.5,2.9,3.9,6.0,8.4,2.0,4.5,7.4,6.6,1.7,3.6,5.2,7.0,6.1,1.3,3.4,6.6
66
+ DAYM780201,100.,65.,134.,106.,20.,93.,102.,49.,66.,96.,40.,56.,94.,41.,56.,120.,97.,18.,41.,74.
67
+ DESM900101,1.56,0.59,0.51,0.23,1.80,0.39,0.19,1.03,1.,1.27,1.38,0.15,1.93,1.42,0.27,0.96,1.11,0.91,1.10,1.58
68
+ DESM900102,1.26,0.38,0.59,0.27,1.60,0.39,0.23,1.08,1.,1.44,1.36,0.33,1.52,1.46,0.54,0.98,1.01,1.06,0.89,1.33
69
+ EISD840101,0.25,-1.76,-0.64,-0.72,0.04,-0.69,-0.62,0.16,-0.40,0.73,0.53,-1.10,0.26,0.61,-0.07,-0.26,-0.18,0.37,0.02,0.54
70
+ EISD860101,0.67,-2.1,-0.6,-1.2,0.38,-0.22,-0.76,0.,0.64,1.9,1.9,-0.57,2.4,2.3,1.2,0.01,0.52,2.6,1.6,1.5
71
+ EISD860102,0.,10.,1.3,1.9,0.17,1.9,3.,0.,0.99,1.2,1.0,5.7,1.9,1.1,0.18,0.73,1.5,1.6,1.8,0.48
72
+ EISD860103,0.,-0.96,-0.86,-0.98,0.76,-1.0,-0.89,0.,-0.75,0.99,0.89,-0.99,0.94,0.92,0.22,-0.67,0.09,0.67,-0.93,0.84
73
+ FASG760101,89.09,174.20,132.12,133.10,121.15,146.15,147.13,75.07,155.16,131.17,131.17,146.19,149.21,165.19,115.13,105.09,119.12,204.24,181.19,117.15
74
+ FASG760102,297.,238.,236.,270.,178.,185.,249.,290.,277.,284.,337.,224.,283.,284.,222.,228.,253.,282.,344.,293.
75
+ FASG760103,1.80,12.50,-5.60,5.05,-16.50,6.30,12.00,0.00,-38.50,12.40,-11.00,14.60,-10.00,-34.50,-86.20,-7.50,-28.00,-33.70,-10.00,5.63
76
+ FASG760104,9.69,8.99,8.80,9.60,8.35,9.13,9.67,9.78,9.17,9.68,9.60,9.18,9.21,9.18,10.64,9.21,9.10,9.44,9.11,9.62
77
+ FASG760105,2.34,1.82,2.02,1.88,1.92,2.17,2.10,2.35,1.82,2.36,2.36,2.16,2.28,2.16,1.95,2.19,2.09,2.43,2.20,2.32
78
+ FAUJ830101,0.31,-1.01,-0.60,-0.77,1.54,-0.22,-0.64,0.00,0.13,1.80,1.70,-0.99,1.23,1.79,0.72,-0.04,0.26,2.25,0.96,1.22
79
+ FAUJ880101,1.28,2.34,1.60,1.60,1.77,1.56,1.56,0.00,2.99,4.19,2.59,1.89,2.35,2.94,2.67,1.31,3.03,3.21,2.94,3.67
80
+ FAUJ880102,0.53,0.69,0.58,0.59,0.66,0.71,0.72,0.00,0.64,0.96,0.92,0.78,0.77,0.71,0.,0.55,0.63,0.84,0.71,0.89
81
+ FAUJ880103,1.00,6.13,2.95,2.78,2.43,3.95,3.78,0.00,4.66,4.00,4.00,4.77,4.43,5.89,2.72,1.60,2.60,8.08,6.47,3.00
82
+ FAUJ880104,2.87,7.82,4.58,4.74,4.47,6.11,5.97,2.06,5.23,4.92,4.92,6.89,6.36,4.62,4.11,3.97,4.11,7.68,4.73,4.11
83
+ FAUJ880105,1.52,1.52,1.52,1.52,1.52,1.52,1.52,1.00,1.52,1.90,1.52,1.52,1.52,1.52,1.52,1.52,1.73,1.52,1.52,1.90
84
+ FAUJ880106,2.04,6.24,4.37,3.78,3.41,3.53,3.31,1.00,5.66,3.49,4.45,4.87,4.80,6.02,4.31,2.70,3.17,5.90,6.72,3.17
85
+ FAUJ880107,7.3,11.1,8.0,9.2,14.4,10.6,11.4,0.0,10.2,16.1,10.1,10.9,10.4,13.9,17.8,13.1,16.7,13.2,13.9,17.2
86
+ FAUJ880108,-0.01,0.04,0.06,0.15,0.12,0.05,0.07,0.00,0.08,-0.01,-0.01,0.00,0.04,0.03,0.,0.11,0.04,0.00,0.03,0.01
87
+ FAUJ880109,0.,4.,2.,1.,0.,2.,1.,0.,1.,0.,0.,2.,0.,0.,0.,1.,1.,1.,1.,0.
88
+ FAUJ880110,0.,3.,3.,4.,0.,3.,4.,0.,1.,0.,0.,1.,0.,0.,0.,2.,2.,0.,2.,0.
89
+ FAUJ880111,0.,1.,0.,0.,0.,0.,0.,0.,1.,0.,0.,1.,0.,0.,0.,0.,0.,0.,0.,0.
90
+ FAUJ880112,0.,0.,0.,1.,0.,0.,1.,0.,0.,0.,0.,0.,0.,0.,0.,0.,0.,0.,0.,0.
91
+ FAUJ880113,4.76,4.30,3.64,5.69,3.67,4.54,5.48,3.77,2.84,4.81,4.79,4.27,4.25,4.31,0.,3.83,3.87,4.75,4.30,4.86
92
+ FINA770101,1.08,1.05,0.85,0.85,0.95,0.95,1.15,0.55,1.00,1.05,1.25,1.15,1.15,1.10,0.71,0.75,0.75,1.10,1.10,0.95
93
+ FINA910101,1.,0.70,1.70,3.20,1.,1.,1.70,1.,1.,0.60,1.,0.70,1.,1.,1.,1.70,1.70,1.,1.,0.60
94
+ FINA910102,1.,0.70,1.,1.70,1.,1.,1.70,1.30,1.,1.,1.,0.70,1.,1.,13.,1.,1.,1.,1.,1.
95
+ FINA910103,1.20,1.70,1.20,0.70,1.,1.,0.70,0.80,1.20,0.80,1.,1.70,1.,1.,1.,1.50,1.,1.,1.,0.80
96
+ FINA910104,1.,1.70,1.,0.70,1.,1.,0.70,1.50,1.,1.,1.,1.70,1.,1.,0.10,1.,1.,1.,1.,1.
97
+ GARJ730101,0.28,0.10,0.25,0.21,0.28,0.35,0.33,0.17,0.21,0.82,1.00,0.09,0.74,2.18,0.39,0.12,0.21,5.70,1.26,0.60
98
+ GEIM800101,1.29,1.,0.81,1.10,0.79,1.07,1.49,0.63,1.33,1.05,1.31,1.33,1.54,1.13,0.63,0.78,0.77,1.18,0.71,0.81
99
+ GEIM800102,1.13,1.09,1.06,0.94,1.32,0.93,1.20,0.83,1.09,1.05,1.13,1.08,1.23,1.01,0.82,1.01,1.17,1.32,0.88,1.13
100
+ GEIM800103,1.55,0.20,1.20,1.55,1.44,1.13,1.67,0.59,1.21,1.27,1.25,1.20,1.37,0.40,0.21,1.01,0.55,1.86,1.08,0.64
101
+ GEIM800104,1.19,1.,0.94,1.07,0.95,1.32,1.64,0.60,1.03,1.12,1.18,1.27,1.49,1.02,0.68,0.81,0.85,1.18,0.77,0.74
102
+ GEIM800105,0.84,1.04,0.66,0.59,1.27,1.02,0.57,0.94,0.81,1.29,1.10,0.86,0.88,1.15,0.80,1.05,1.20,1.15,1.39,1.56
103
+ GEIM800106,0.86,1.15,0.60,0.66,0.91,1.11,0.37,0.86,1.07,1.17,1.28,1.01,1.15,1.34,0.61,0.91,1.14,1.13,1.37,1.31
104
+ GEIM800107,0.91,0.99,0.72,0.74,1.12,0.90,0.41,0.91,1.01,1.29,1.23,0.86,0.96,1.26,0.65,0.93,1.05,1.15,1.21,1.58
105
+ GEIM800108,0.91,1.,1.64,1.40,0.93,0.94,0.97,1.51,0.90,0.65,0.59,0.82,0.58,0.72,1.66,1.23,1.04,0.67,0.92,0.60
106
+ GEIM800109,0.80,0.96,1.10,1.60,0.,1.60,0.40,2.,0.96,0.85,0.80,0.94,0.39,1.20,2.10,1.30,0.60,0.,1.80,0.80
107
+ GEIM800110,1.10,0.93,1.57,1.41,1.05,0.81,1.40,1.30,0.85,0.67,0.52,0.94,0.69,0.60,1.77,1.13,0.88,0.62,0.41,0.58
108
+ GEIM800111,0.93,1.01,1.36,1.22,0.92,0.83,1.05,1.45,0.96,0.58,0.59,0.91,0.60,0.71,1.67,1.25,1.08,0.68,0.98,0.62
109
+ GOLD730101,0.75,0.75,0.69,0.00,1.00,0.59,0.00,0.00,0.00,2.95,2.40,1.50,1.30,2.65,2.60,0.00,0.45,3.00,2.85,1.70
110
+ GOLD730102,88.3,181.2,125.1,110.8,112.4,148.7,140.5,60.0,152.6,168.5,168.5,175.6,162.2,189.0,122.2,88.7,118.2,227.0,193.0,141.4
111
+ GRAR740101,0.00,0.65,1.33,1.38,2.75,0.89,0.92,0.74,0.58,0.00,0.00,0.33,0.00,0.00,0.39,1.42,0.71,0.13,0.20,0.00
112
+ GRAR740102,8.1,10.5,11.6,13.0,5.5,10.5,12.3,9.0,10.4,5.2,4.9,11.3,5.7,5.2,8.0,9.2,8.6,5.4,6.2,5.9
113
+ GRAR740103,31.,124.,56.,54.,55.,85.,83.,3.,96.,111.,111.,119.,105.,132.,32.5,32.,61.,170.,136.,84.
114
+ GUYH850101,0.10,1.91,0.48,0.78,-1.42,0.95,0.83,0.33,-0.50,-1.13,-1.18,1.40,-1.59,-2.12,0.73,0.52,0.07,-0.51,-0.21,-1.27
115
+ HOPA770101,1.0,2.3,2.2,6.5,0.1,2.1,6.2,1.1,2.8,0.8,0.8,5.3,0.7,1.4,0.9,1.7,1.5,1.9,2.1,0.9
116
+ HOPT810101,-0.5,3.0,0.2,3.0,-1.0,0.2,3.0,0.0,-0.5,-1.8,-1.8,3.0,-1.3,-2.5,0.0,0.3,-0.4,-3.4,-2.3,-1.5
117
+ HUTJ700101,29.22,26.37,38.30,37.09,50.70,44.02,41.84,23.71,59.64,45.00,48.03,57.10,69.32,48.52,36.13,32.40,35.20,56.92,51.73,40.35
118
+ HUTJ700102,30.88,68.43,41.70,40.66,53.83,46.62,44.98,24.74,65.99,49.71,50.62,63.21,55.32,51.06,39.21,35.65,36.50,60.00,51.15,42.75
119
+ HUTJ700103,154.33,341.01,207.90,194.91,219.79,235.51,223.16,127.90,242.54,233.21,232.30,300.46,202.65,204.74,179.93,174.06,205.80,237.01,229.15,207.60
120
+ ISOY800101,1.53,1.17,0.60,1.00,0.89,1.27,1.63,0.44,1.03,1.07,1.32,1.26,1.66,1.22,0.25,0.65,0.86,1.05,0.70,0.93
121
+ ISOY800102,0.86,0.98,0.74,0.69,1.39,0.89,0.66,0.70,1.06,1.31,1.01,0.77,1.06,1.16,1.16,1.09,1.24,1.17,1.28,1.40
122
+ ISOY800103,0.78,1.06,1.56,1.50,0.60,0.78,0.97,1.73,0.83,0.40,0.57,1.01,0.30,0.67,1.55,1.19,1.09,0.74,1.14,0.44
123
+ ISOY800104,1.09,0.97,1.14,0.77,0.50,0.83,0.92,1.25,0.67,0.66,0.44,1.25,0.45,0.50,2.96,1.21,1.33,0.62,0.94,0.56
124
+ ISOY800105,0.35,0.75,2.12,2.16,0.50,0.73,0.65,2.40,1.19,0.12,0.58,0.83,0.22,0.89,0.43,1.24,0.85,0.62,1.44,0.43
125
+ ISOY800106,1.09,1.07,0.88,1.24,1.04,1.09,1.14,0.27,1.07,0.97,1.30,1.20,0.55,0.80,1.78,1.20,0.99,1.03,0.69,0.77
126
+ ISOY800107,1.34,2.78,0.92,1.77,1.44,0.79,2.54,0.95,0.00,0.52,1.05,0.79,0.00,0.43,0.37,0.87,1.14,1.79,0.73,0.00
127
+ ISOY800108,0.47,0.52,2.16,1.15,0.41,0.95,0.64,3.03,0.89,0.62,0.53,0.98,0.68,0.61,0.63,1.03,0.39,0.63,0.83,0.76
128
+ JANJ780101,27.8,94.7,60.1,60.6,15.5,68.7,68.2,24.5,50.7,22.8,27.6,103.0,33.5,25.5,51.5,42.0,45.0,34.7,55.2,23.7
129
+ JANJ780102,51.,5.,22.,19.,74.,16.,16.,52.,34.,66.,60.,3.,52.,58.,25.,35.,30.,49.,24.,64.
130
+ JANJ780103,15.,67.,49.,50.,5.,56.,55.,10.,34.,13.,16.,85.,20.,10.,45.,32.,32.,17.,41.,14.
131
+ JANJ790101,1.7,0.1,0.4,0.4,4.6,0.3,0.3,1.8,0.8,3.1,2.4,0.05,1.9,2.2,0.6,0.8,0.7,1.6,0.5,2.9
132
+ JANJ790102,0.3,-1.4,-0.5,-0.6,0.9,-0.7,-0.7,0.3,-0.1,0.7,0.5,-1.8,0.4,0.5,-0.3,-0.1,-0.2,0.3,-0.4,0.6
133
+ JOND750101,0.87,0.85,0.09,0.66,1.52,0.00,0.67,0.10,0.87,3.15,2.17,1.64,1.67,2.87,2.77,0.07,0.07,3.77,2.67,1.87
134
+ JOND750102,2.34,1.18,2.02,2.01,1.65,2.17,2.19,2.34,1.82,2.36,2.36,2.18,2.28,1.83,1.99,2.21,2.10,2.38,2.20,2.32
135
+ JOND920101,0.077,0.051,0.043,0.052,0.020,0.041,0.062,0.074,0.023,0.053,0.091,0.059,0.024,0.040,0.051,0.069,0.059,0.014,0.032,0.066
136
+ JOND920102,100.,83.,104.,86.,44.,84.,77.,50.,91.,103.,54.,72.,93.,51.,58.,117.,107.,25.,50.,98.
137
+ JUKT750101,5.3,2.6,3.0,3.6,1.3,2.4,3.3,4.8,1.4,3.1,4.7,4.1,1.1,2.3,2.5,4.5,3.7,0.8,2.3,4.2
138
+ JUNJ780101,685.,382.,397.,400.,241.,313.,427.,707.,155.,394.,581.,575.,132.,303.,366.,593.,490.,99.,292.,553.
139
+ KANM800101,1.36,1.00,0.89,1.04,0.82,1.14,1.48,0.63,1.11,1.08,1.21,1.22,1.45,1.05,0.52,0.74,0.81,0.97,0.79,0.94
140
+ KANM800102,0.81,0.85,0.62,0.71,1.17,0.98,0.53,0.88,0.92,1.48,1.24,0.77,1.05,1.20,0.61,0.92,1.18,1.18,1.23,1.66
141
+ KANM800103,1.45,1.15,0.64,0.91,0.70,1.14,1.29,0.53,1.13,1.23,1.56,1.27,1.83,1.20,0.21,0.48,0.77,1.17,0.74,1.10
142
+ KANM800104,0.75,0.79,0.33,0.31,1.46,0.75,0.46,0.83,0.83,1.87,1.56,0.66,0.86,1.37,0.52,0.82,1.36,0.79,1.08,2.00
143
+ KARP850101,1.041,1.038,1.117,1.033,0.960,1.165,1.094,1.142,0.982,1.002,0.967,1.093,0.947,0.930,1.055,1.169,1.073,0.925,0.961,0.982
144
+ KARP850102,0.946,1.028,1.006,1.089,0.878,1.025,1.036,1.042,0.952,0.892,0.961,1.082,0.862,0.912,1.085,1.048,1.051,0.917,0.930,0.927
145
+ KARP850103,0.892,0.901,0.930,0.932,0.925,0.885,0.933,0.923,0.894,0.872,0.921,1.057,0.804,0.914,0.932,0.923,0.934,0.803,0.837,0.913
146
+ KHAG800101,49.1,133.,-3.6,0.,0.,20.,0.,64.6,75.7,18.9,15.6,0.,6.8,54.7,43.8,44.4,31.0,70.5,0.,29.5
147
+ KLEP840101,0.,1.,0.,-1.,0.,0.,-1.,0.,0.,0.,0.,1.,0.,0.,0.,0.,0.,0.,0.,0.
148
+ KRIW710101,4.60,6.50,5.90,5.70,-1.00,6.10,5.60,7.60,4.50,2.60,3.25,7.90,1.40,3.20,7.00,5.25,4.80,4.00,4.35,3.40
149
+ KRIW790101,4.32,6.55,6.24,6.04,1.73,6.13,6.17,6.09,5.66,2.31,3.93,7.92,2.44,2.59,7.19,5.37,5.16,2.78,3.58,3.31
150
+ KRIW790102,0.28,0.34,0.31,0.33,0.11,0.39,0.37,0.28,0.23,0.12,0.16,0.59,0.08,0.10,0.46,0.27,0.26,0.15,0.25,0.22
151
+ KRIW790103,27.5,105.0,58.7,40.0,44.6,80.7,62.0,0.0,79.0,93.5,93.5,100.0,94.1,115.5,41.9,29.3,51.3,145.5,117.3,71.5
152
+ KYTJ820101,1.8,-4.5,-3.5,-3.5,2.5,-3.5,-3.5,-0.4,-3.2,4.5,3.8,-3.9,1.9,2.8,-1.6,-0.8,-0.7,-0.9,-1.3,4.2
153
+ LAWE840101,-0.48,-0.06,-0.87,-0.75,-0.32,-0.32,-0.71,0.00,-0.51,0.81,1.02,-0.09,0.81,1.03,2.03,0.05,-0.35,0.66,1.24,0.56
154
+ LEVM760101,-0.5,3.0,0.2,2.5,-1.0,0.2,2.5,0.0,-0.5,-1.8,-1.8,3.0,-1.3,-2.5,-1.4,0.3,-0.4,-3.4,-2.3,-1.5
155
+ LEVM760102,0.77,3.72,1.98,1.99,1.38,2.58,2.63,0.00,2.76,1.83,2.08,2.94,2.34,2.97,1.42,1.28,1.43,3.58,3.36,1.49
156
+ LEVM760103,121.9,121.4,117.5,121.2,113.7,118.0,118.2,0.,118.2,118.9,118.1,122.0,113.1,118.2,81.9,117.9,117.1,118.4,110.0,121.7
157
+ LEVM760104,243.2,206.6,207.1,215.0,209.4,205.4,213.6,300.0,219.9,217.9,205.6,210.9,204.0,203.7,237.4,232.0,226.7,203.7,195.6,220.3
158
+ LEVM760105,0.77,2.38,1.45,1.43,1.22,1.75,1.77,0.58,1.78,1.56,1.54,2.08,1.80,1.90,1.25,1.08,1.24,2.21,2.13,1.29
159
+ LEVM760106,5.2,6.0,5.0,5.0,6.1,6.0,6.0,4.2,6.0,7.0,7.0,6.0,6.8,7.1,6.2,4.9,5.0,7.6,7.1,6.4
160
+ LEVM760107,0.025,0.20,0.10,0.10,0.10,0.10,0.10,0.025,0.10,0.19,0.19,0.20,0.19,0.39,0.17,0.025,0.10,0.56,0.39,0.15
161
+ LEVM780101,1.29,0.96,0.90,1.04,1.11,1.27,1.44,0.56,1.22,0.97,1.30,1.23,1.47,1.07,0.52,0.82,0.82,0.99,0.72,0.91
162
+ LEVM780102,0.90,0.99,0.76,0.72,0.74,0.80,0.75,0.92,1.08,1.45,1.02,0.77,0.97,1.32,0.64,0.95,1.21,1.14,1.25,1.49
163
+ LEVM780103,0.77,0.88,1.28,1.41,0.81,0.98,0.99,1.64,0.68,0.51,0.58,0.96,0.41,0.59,1.91,1.32,1.04,0.76,1.05,0.47
164
+ LEVM780104,1.32,0.98,0.95,1.03,0.92,1.10,1.44,0.61,1.31,0.93,1.31,1.25,1.39,1.02,0.58,0.76,0.79,0.97,0.73,0.93
165
+ LEVM780105,0.86,0.97,0.73,0.69,1.04,1.00,0.66,0.89,0.85,1.47,1.04,0.77,0.93,1.21,0.68,1.02,1.27,1.26,1.31,1.43
166
+ LEVM780106,0.79,0.90,1.25,1.47,0.79,0.92,1.02,1.67,0.81,0.50,0.57,0.99,0.51,0.77,1.78,1.30,0.97,0.79,0.93,0.46
167
+ LEWP710101,0.22,0.28,0.42,0.73,0.20,0.26,0.08,0.58,0.14,0.22,0.19,0.27,0.38,0.08,0.46,0.55,0.49,0.43,0.46,0.08
168
+ LIFS790101,0.92,0.93,0.60,0.48,1.16,0.95,0.61,0.61,0.93,1.81,1.30,0.70,1.19,1.25,0.40,0.82,1.12,1.54,1.53,1.81
169
+ LIFS790102,1.00,0.68,0.54,0.50,0.91,0.28,0.59,0.79,0.38,2.60,1.42,0.59,1.49,1.30,0.35,0.70,0.59,0.89,1.08,2.63
170
+ LIFS790103,0.90,1.02,0.62,0.47,1.24,1.18,0.62,0.56,1.12,1.54,1.26,0.74,1.09,1.23,0.42,0.87,1.30,1.75,1.68,1.53
171
+ MANP780101,12.97,11.72,11.42,10.85,14.63,11.76,11.89,12.43,12.16,15.67,14.90,11.36,14.39,14.00,11.37,11.23,11.69,13.93,13.42,15.71
172
+ MAXF760101,1.43,1.18,0.64,0.92,0.94,1.22,1.67,0.46,0.98,1.04,1.36,1.27,1.53,1.19,0.49,0.70,0.78,1.01,0.69,0.98
173
+ MAXF760102,0.86,0.94,0.74,0.72,1.17,0.89,0.62,0.97,1.06,1.24,0.98,0.79,1.08,1.16,1.22,1.04,1.18,1.07,1.25,1.33
174
+ MAXF760103,0.64,0.62,3.14,1.92,0.32,0.80,1.01,0.63,2.05,0.92,0.37,0.89,1.07,0.86,0.50,1.01,0.92,1.00,1.31,0.87
175
+ MAXF760104,0.17,0.76,2.62,1.08,0.95,0.91,0.28,5.02,0.57,0.26,0.21,1.17,0.00,0.28,0.12,0.57,0.23,0.00,0.97,0.24
176
+ MAXF760105,1.13,0.48,1.11,1.18,0.38,0.41,1.02,3.84,0.30,0.40,0.65,1.13,0.00,0.45,0.00,0.81,0.71,0.93,0.38,0.48
177
+ MAXF760106,1.00,1.18,0.87,1.39,1.09,1.13,1.04,0.46,0.71,0.68,1.01,1.05,0.36,0.65,1.95,1.56,1.23,1.10,0.87,0.58
178
+ MCMT640101,4.34,26.66,13.28,12.00,35.77,17.56,17.26,0.00,21.81,19.06,18.78,21.29,21.64,29.40,10.93,6.35,11.01,42.53,31.53,13.92
179
+ MEEJ800101,0.5,0.8,0.8,-8.2,-6.8,-4.8,-16.9,0.0,-3.5,13.9,8.8,0.1,4.8,13.2,6.1,1.2,2.7,14.9,6.1,2.7
180
+ MEEJ800102,-0.1,-4.5,-1.6,-2.8,-2.2,-2.5,-7.5,-0.5,0.8,11.8,10.0,-3.2,7.1,13.9,8.0,-3.7,1.5,18.1,8.2,3.3
181
+ MEEJ810101,1.1,-0.4,-4.2,-1.6,7.1,-2.9,0.7,-0.2,-0.7,8.5,11.0,-1.9,5.4,13.4,4.4,-3.2,-1.7,17.1,7.4,5.9
182
+ MEEJ810102,1.0,-2.0,-3.0,-0.5,4.6,-2.0,1.1,0.2,-2.2,7.0,9.6,-3.0,4.0,12.6,3.1,-2.9,-0.6,15.1,6.7,4.6
183
+ MEIH800101,0.93,0.98,0.98,1.01,0.88,1.02,1.02,1.01,0.89,0.79,0.85,1.05,0.84,0.78,1.00,1.02,0.99,0.83,0.93,0.81
184
+ MEIH800102,0.94,1.09,1.04,1.08,0.84,1.11,1.12,1.01,0.92,0.76,0.82,1.23,0.83,0.73,1.04,1.04,1.02,0.87,1.03,0.81
185
+ MEIH800103,87.,81.,70.,71.,104.,66.,72.,90.,90.,105.,104.,65.,100.,108.,78.,83.,83.,94.,83.,94.
186
+ MIYS850101,2.36,1.92,1.70,1.67,3.36,1.75,1.74,2.06,2.41,4.17,3.93,1.23,4.22,4.37,1.89,1.81,2.04,3.82,2.91,3.49
187
+ NAGK730101,1.29,0.83,0.77,1.00,0.94,1.10,1.54,0.72,1.29,0.94,1.23,1.23,1.23,1.23,0.70,0.78,0.87,1.06,0.63,0.97
188
+ NAGK730102,0.96,0.67,0.72,0.90,1.13,1.18,0.33,0.90,0.87,1.54,1.26,0.81,1.29,1.37,0.75,0.77,1.23,1.13,1.07,1.41
189
+ NAGK730103,0.72,1.33,1.38,1.04,1.01,0.81,0.75,1.35,0.76,0.80,0.63,0.84,0.62,0.58,1.43,1.34,1.03,0.87,1.35,0.83
190
+ NAKH900101,7.99,5.86,4.33,5.14,1.81,3.98,6.10,6.91,2.17,5.48,9.16,6.01,2.50,3.83,4.95,6.84,5.77,1.34,3.15,6.65
191
+ NAKH900102,3.73,3.34,2.33,2.23,2.30,2.36,3.,3.36,1.55,2.52,3.40,3.36,1.37,1.94,3.18,2.83,2.63,1.15,1.76,2.53
192
+ NAKH900103,5.74,1.92,5.25,2.11,1.03,2.30,2.63,5.66,2.30,9.12,15.36,3.20,5.30,6.51,4.79,7.55,7.51,2.51,4.08,5.12
193
+ NAKH900104,-0.60,-1.18,0.39,-1.36,-0.34,-0.71,-1.16,-0.37,0.08,1.44,1.82,-0.84,2.04,1.38,-0.05,0.25,0.66,1.02,0.53,-0.60
194
+ NAKH900105,5.88,1.54,4.38,1.70,1.11,2.30,2.60,5.29,2.33,8.78,16.52,2.58,6.00,6.58,5.29,7.68,8.38,2.89,3.51,4.66
195
+ NAKH900106,-0.57,-1.29,0.02,-1.54,-0.30,-0.71,-1.17,-0.48,0.10,1.31,2.16,-1.02,2.55,1.42,0.11,0.30,0.99,1.35,0.20,-0.79
196
+ NAKH900107,5.39,2.81,7.31,3.07,0.86,2.31,2.70,6.52,2.23,9.94,12.64,4.67,3.68,6.34,3.62,7.24,5.44,1.64,5.42,6.18
197
+ NAKH900108,-0.70,-0.91,1.28,-0.93,-0.41,-0.71,-1.13,-0.12,0.04,1.77,1.02,-0.40,0.86,1.29,-0.42,0.14,-0.13,0.26,1.29,-0.19
198
+ NAKH900109,9.25,3.96,3.71,3.89,1.07,3.17,4.80,8.51,1.88,6.47,10.94,3.50,3.14,6.36,4.36,6.26,5.66,2.22,3.28,7.55
199
+ NAKH900110,0.34,-0.57,-0.27,-0.56,-0.32,-0.34,-0.43,0.48,-0.19,0.39,0.52,-0.75,0.47,1.30,-0.19,-0.20,-0.04,0.77,0.07,0.36
200
+ NAKH900111,10.17,1.21,1.36,1.18,1.48,1.57,1.15,8.87,1.07,10.91,16.22,1.04,4.12,9.60,2.24,5.38,5.61,2.67,2.68,11.44
201
+ NAKH900112,6.61,0.41,1.84,0.59,0.83,1.20,1.63,4.88,1.14,12.91,21.66,1.15,7.17,7.76,3.51,6.84,8.89,2.11,2.57,6.30
202
+ NAKH900113,1.61,0.40,0.73,0.75,0.37,0.61,1.50,3.12,0.46,1.61,1.37,0.62,1.59,1.24,0.67,0.68,0.92,1.63,0.67,1.30
203
+ NAKH920101,8.63,6.75,4.18,6.24,1.03,4.76,7.82,6.80,2.70,3.48,8.44,6.25,2.14,2.73,6.28,8.53,4.43,0.80,2.54,5.44
204
+ NAKH920102,10.88,6.01,5.75,6.13,0.69,4.68,9.34,7.72,2.15,1.80,8.03,6.11,3.79,2.93,7.21,7.25,3.51,0.47,1.01,4.57
205
+ NAKH920103,5.15,4.38,4.81,5.75,3.24,4.45,7.05,6.38,2.69,4.40,8.11,5.25,1.60,3.52,5.65,8.04,7.41,1.68,3.42,7.00
206
+ NAKH920104,5.04,3.73,5.94,5.26,2.20,4.50,6.07,7.09,2.99,4.32,9.88,6.31,1.85,3.72,6.22,8.05,5.20,2.10,3.32,6.19
207
+ NAKH920105,9.90,0.09,0.94,0.35,2.55,0.87,0.08,8.14,0.20,15.25,22.28,0.16,1.85,6.47,2.38,4.17,4.33,2.21,3.42,14.34
208
+ NAKH920106,6.69,6.65,4.49,4.97,1.70,5.39,7.76,6.32,2.11,4.51,8.23,8.36,2.46,3.59,5.20,7.40,5.18,1.06,2.75,5.27
209
+ NAKH920107,5.08,4.75,5.75,5.96,2.95,4.24,6.04,8.20,2.10,4.95,8.03,4.93,2.61,4.36,4.84,6.41,5.87,2.31,4.55,6.07
210
+ NAKH920108,9.36,0.27,2.31,0.94,2.56,1.14,0.94,6.17,0.47,13.73,16.64,0.58,3.93,10.99,1.96,5.58,4.68,2.20,3.13,12.43
211
+ NISK800101,0.23,-0.26,-0.94,-1.13,1.78,-0.57,-0.75,-0.07,0.11,1.19,1.03,-1.05,0.66,0.48,-0.76,-0.67,-0.36,0.90,0.59,1.24
212
+ NISK860101,-0.22,-0.93,-2.65,-4.12,4.66,-2.76,-3.64,-1.62,1.28,5.58,5.01,-4.18,3.51,5.27,-3.03,-2.84,-1.20,5.20,2.15,4.45
213
+ NOZY710101,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,1.8,1.8,0.0,1.3,2.5,0.0,0.0,0.4,3.4,2.3,1.5
214
+ OOBM770101,-1.895,-1.475,-1.560,-1.518,-2.035,-1.521,-1.535,-1.898,-1.755,-1.951,-1.966,-1.374,-1.963,-1.864,-1.699,-1.753,-1.767,-1.869,-1.686,-1.981
215
+ OOBM770102,-1.404,-0.921,-1.178,-1.162,-1.365,-1.116,-1.163,-1.364,-1.215,-1.189,-1.315,-1.074,-1.303,-1.135,-1.236,-1.297,-1.252,-1.030,-1.030,-1.254
216
+ OOBM770103,-0.491,-0.554,-0.382,-0.356,-0.670,-0.405,-0.371,-0.534,-0.540,-0.762,-0.650,-0.300,-0.659,-0.729,-0.463,-0.455,-0.515,-0.839,-0.656,-0.728
217
+ OOBM770104,-9.475,-16.225,-12.480,-12.144,-12.210,-13.689,-13.815,-7.592,-17.550,-15.608,-15.728,-12.366,-15.704,-20.504,-11.893,-10.518,-12.369,-26.166,-20.232,-13.867
218
+ OOBM770105,-7.020,-10.131,-9.424,-9.296,-8.190,-10.044,-10.467,-5.456,-12.150,-9.512,-10.520,-9.666,-10.424,-12.485,-8.652,-7.782,-8.764,-14.420,-12.360,-8.778
219
+ OOBM850101,2.01,0.84,0.03,-2.05,1.98,1.02,0.93,0.12,-0.14,3.70,2.73,2.55,1.75,2.68,0.41,1.47,2.39,2.49,2.23,3.50
220
+ OOBM850102,1.34,0.95,2.49,3.32,1.07,1.49,2.20,2.07,1.27,0.66,0.54,0.61,0.70,0.80,2.12,0.94,1.09,-4.65,-0.17,1.32
221
+ OOBM850103,0.46,-1.54,1.31,-0.33,0.20,-1.12,0.48,0.64,-1.31,3.28,0.43,-1.71,0.15,0.52,-0.58,-0.83,-1.52,1.25,-2.21,0.54
222
+ OOBM850104,-2.49,2.55,2.27,8.86,-3.13,1.79,4.04,-0.56,4.22,-10.87,-7.16,-9.97,-4.96,-6.64,5.19,-1.60,-4.75,-17.84,9.25,-3.97
223
+ OOBM850105,4.55,5.97,5.56,2.85,-0.78,4.15,5.16,9.14,4.48,2.10,3.24,10.68,2.18,4.37,5.14,6.78,8.60,1.97,2.40,3.81
224
+ PALJ810101,1.30,0.93,0.90,1.02,0.92,1.04,1.43,0.63,1.33,0.87,1.30,1.23,1.32,1.09,0.63,0.78,0.80,1.03,0.71,0.95
225
+ PALJ810102,1.32,1.04,0.74,0.97,0.70,1.25,1.48,0.59,1.06,1.01,1.22,1.13,1.47,1.10,0.57,0.77,0.86,1.02,0.72,1.05
226
+ PALJ810103,0.81,1.03,0.81,0.71,1.12,1.03,0.59,0.94,0.85,1.47,1.03,0.77,0.96,1.13,0.75,1.02,1.19,1.24,1.35,1.44
227
+ PALJ810104,0.90,0.75,0.82,0.75,1.12,0.95,0.44,0.83,0.86,1.59,1.24,0.75,0.94,1.41,0.46,0.70,1.20,1.28,1.45,1.73
228
+ PALJ810105,0.84,0.91,1.48,1.28,0.69,1.,0.78,1.76,0.53,0.55,0.49,0.95,0.52,0.88,1.47,1.29,1.05,0.88,1.28,0.51
229
+ PALJ810106,0.65,0.93,1.45,1.47,1.43,0.94,0.75,1.53,0.96,0.57,0.56,0.95,0.71,0.72,1.51,1.46,0.96,0.90,1.12,0.55
230
+ PALJ810107,1.08,0.93,1.05,0.86,1.22,0.95,1.09,0.85,1.02,0.98,1.04,1.01,1.11,0.96,0.91,0.95,1.15,1.17,0.80,1.03
231
+ PALJ810108,1.34,0.91,0.83,1.06,1.27,1.13,1.69,0.47,1.11,0.84,1.39,1.08,0.90,1.02,0.48,1.05,0.74,0.64,0.73,1.18
232
+ PALJ810109,1.15,1.06,0.87,1.,1.03,1.43,1.37,0.64,0.95,0.99,1.22,1.20,1.45,0.92,0.72,0.84,0.97,1.11,0.72,0.82
233
+ PALJ810110,0.89,1.06,0.67,0.71,1.04,1.06,0.72,0.87,1.04,1.14,1.02,1.,1.41,1.32,0.69,0.86,1.15,1.06,1.35,1.66
234
+ PALJ810111,0.82,0.99,1.27,0.98,0.71,1.01,0.54,0.94,1.26,1.67,0.94,0.73,1.30,1.56,0.69,0.65,0.98,1.25,1.26,1.22
235
+ PALJ810112,0.98,1.03,0.66,0.74,1.01,0.63,0.59,0.90,1.17,1.38,1.05,0.83,0.82,1.23,0.73,0.98,1.20,1.26,1.23,1.62
236
+ PALJ810113,0.69,0.,1.52,2.42,0.,1.44,0.63,2.64,0.22,0.43,0.,1.18,0.88,2.20,1.34,1.43,0.28,0.,1.53,0.14
237
+ PALJ810114,0.87,1.30,1.36,1.24,0.83,1.06,0.91,1.69,0.91,0.27,0.67,0.66,0.,0.47,1.54,1.08,1.12,1.24,0.54,0.69
238
+ PALJ810115,0.91,0.77,1.32,0.90,0.50,1.06,0.53,1.61,1.08,0.36,0.77,1.27,0.76,0.37,1.62,1.34,0.87,1.10,1.24,0.52
239
+ PALJ810116,0.92,0.90,1.57,1.22,0.62,0.66,0.92,1.61,0.39,0.79,0.50,0.86,0.50,0.96,1.30,1.40,1.11,0.57,1.78,0.50
240
+ PARJ860101,2.1,4.2,7.0,10.0,1.4,6.0,7.8,5.7,2.1,-8.0,-9.2,5.7,-4.2,-9.2,2.1,6.5,5.2,-10.0,-1.9,-3.7
241
+ PLIV810101,-2.89,-3.30,-3.41,-3.38,-2.49,-3.15,-2.94,-3.25,-2.84,-1.72,-1.61,-3.31,-1.84,-1.63,-2.50,-3.30,-2.91,-1.75,-2.42,-2.08
242
+ PONP800101,12.28,11.49,11.00,10.97,14.93,11.28,11.19,12.01,12.84,14.77,14.10,10.80,14.33,13.43,11.19,11.26,11.65,12.95,13.29,15.07
243
+ PONP800102,7.62,6.81,6.17,6.18,10.93,6.67,6.38,7.31,7.85,9.99,9.37,5.72,9.83,8.99,6.64,6.93,7.08,8.41,8.53,10.38
244
+ PONP800103,2.63,2.45,2.27,2.29,3.36,2.45,2.31,2.55,2.57,3.08,2.98,2.12,3.18,3.02,2.46,2.60,2.55,2.85,2.79,3.21
245
+ PONP800104,13.65,11.28,12.24,10.98,14.49,11.30,12.55,15.36,11.59,14.63,14.01,11.96,13.40,14.08,11.51,11.26,13.00,12.06,12.64,12.88
246
+ PONP800105,14.60,13.24,11.79,13.78,15.90,12.02,13.59,14.18,15.35,14.10,16.49,13.28,16.23,14.18,14.10,13.36,14.50,13.90,14.76,16.30
247
+ PONP800106,10.67,11.05,10.85,10.21,14.15,11.71,11.71,10.95,12.07,12.95,13.07,9.93,15.00,13.27,10.62,11.18,10.53,11.41,11.52,13.86
248
+ PONP800107,3.70,2.53,2.12,2.60,3.03,2.70,3.30,3.13,3.57,7.69,5.88,1.79,5.21,6.60,2.12,2.43,2.60,6.25,3.03,7.14
249
+ PONP800108,6.05,5.70,5.04,4.95,7.86,5.45,5.10,6.16,5.80,7.51,7.37,4.88,6.39,6.62,5.65,5.53,5.81,6.98,6.73,7.62
250
+ PRAM820101,0.305,0.227,0.322,0.335,0.339,0.306,0.282,0.352,0.215,0.278,0.262,0.391,0.280,0.195,0.346,0.326,0.251,0.291,0.293,0.291
251
+ PRAM820102,0.175,0.083,0.090,0.140,0.074,0.093,0.135,0.201,0.125,0.100,0.104,0.058,0.054,0.104,0.136,0.155,0.152,0.092,0.081,0.096
252
+ PRAM820103,0.687,0.590,0.489,0.632,0.263,0.527,0.669,0.670,0.594,0.564,0.541,0.407,0.328,0.577,0.600,0.692,0.713,0.632,0.495,0.529
253
+ PRAM900101,-6.70,51.50,20.10,38.50,-8.40,17.20,34.30,-4.20,12.60,-13.,-11.70,36.80,-14.20,-15.50,0.80,-2.50,-5.,-7.90,2.90,-10.90
254
+ PRAM900102,1.29,0.96,0.90,1.04,1.11,1.27,1.44,0.56,1.22,0.97,1.30,1.23,1.47,1.07,0.52,0.82,0.82,0.99,0.72,0.91
255
+ PRAM900103,0.90,0.99,0.76,0.72,0.74,0.80,0.75,0.92,1.08,1.45,1.02,0.77,0.97,1.32,0.64,0.95,1.21,1.14,1.25,1.49
256
+ PRAM900104,0.78,0.88,1.28,1.41,0.80,0.97,1.,1.64,0.69,0.51,0.59,0.96,0.39,0.58,1.91,1.33,1.03,0.75,1.05,0.47
257
+ PTIO830101,1.10,0.95,0.80,0.65,0.95,1.00,1.00,0.60,0.85,1.10,1.25,1.00,1.15,1.10,0.10,0.75,0.75,1.10,1.10,0.95
258
+ PTIO830102,1.00,0.70,0.60,0.50,1.90,1.00,0.70,0.30,0.80,4.00,2.00,0.70,1.90,3.10,0.20,0.90,1.70,2.20,2.80,4.00
259
+ QIAN880101,0.12,0.04,-0.10,0.01,-0.25,-0.03,-0.02,-0.02,-0.06,-0.07,0.05,0.26,0.00,0.05,-0.19,-0.19,-0.04,-0.06,-0.14,-0.03
260
+ QIAN880102,0.26,-0.14,-0.03,0.15,-0.15,-0.13,0.21,-0.37,0.10,-0.03,-0.02,0.12,0.00,0.12,-0.08,0.01,-0.34,-0.01,-0.29,0.02
261
+ QIAN880103,0.64,-0.10,0.09,0.33,0.03,-0.23,0.51,-0.09,-0.23,-0.22,0.41,-0.17,0.13,-0.03,-0.43,-0.10,-0.07,-0.02,-0.38,-0.01
262
+ QIAN880104,0.29,-0.03,-0.04,0.11,-0.05,0.26,0.28,-0.67,-0.26,0.00,0.47,-0.19,0.27,0.24,-0.34,-0.17,-0.20,0.25,-0.30,-0.01
263
+ QIAN880105,0.68,-0.22,-0.09,-0.02,-0.15,-0.15,0.44,-0.73,-0.14,-0.08,0.61,0.03,0.39,0.06,-0.76,-0.26,-0.10,0.20,-0.04,0.12
264
+ QIAN880106,0.34,0.22,-0.33,0.06,-0.18,0.01,0.20,-0.88,-0.09,-0.03,0.20,-0.11,0.43,0.15,-0.81,-0.35,-0.37,0.07,-0.31,0.13
265
+ QIAN880107,0.57,0.23,-0.36,-0.46,-0.15,0.15,0.26,-0.71,-0.05,0.00,0.48,0.16,0.41,0.03,-1.12,-0.47,-0.54,-0.10,-0.35,0.31
266
+ QIAN880108,0.33,0.10,-0.19,-0.44,-0.03,0.19,0.21,-0.46,0.27,-0.33,0.57,0.23,0.79,0.48,-1.86,-0.23,-0.33,0.15,-0.19,0.24
267
+ QIAN880109,0.13,0.08,-0.07,-0.71,-0.09,0.12,0.13,-0.39,0.32,0.00,0.50,0.37,0.63,0.15,-1.40,-0.28,-0.21,0.02,-0.10,0.17
268
+ QIAN880110,0.31,0.18,-0.10,-0.81,-0.26,0.41,-0.06,-0.42,0.51,-0.15,0.56,0.47,0.58,0.10,-1.33,-0.49,-0.44,0.14,-0.08,-0.01
269
+ QIAN880111,0.21,0.07,-0.04,-0.58,-0.12,0.13,-0.23,-0.15,0.37,0.31,0.70,0.28,0.61,-0.06,-1.03,-0.28,-0.25,0.21,0.16,0.00
270
+ QIAN880112,0.18,0.21,-0.03,-0.32,-0.29,-0.27,-0.25,-0.40,0.28,-0.03,0.62,0.41,0.21,0.05,-0.84,-0.05,-0.16,0.32,0.11,0.06
271
+ QIAN880113,-0.08,0.05,-0.08,-0.24,-0.25,-0.28,-0.19,-0.10,0.29,-0.01,0.28,0.45,0.11,0.00,-0.42,0.07,-0.33,0.36,0.00,-0.13
272
+ QIAN880114,-0.18,-0.13,0.28,0.05,-0.26,0.21,-0.06,0.23,0.24,-0.42,-0.23,0.03,-0.42,-0.18,-0.13,0.41,0.33,-0.10,-0.10,-0.07
273
+ QIAN880115,-0.01,0.02,0.41,-0.09,-0.27,0.01,0.09,0.13,0.22,-0.27,-0.25,0.08,-0.57,-0.12,0.26,0.44,0.35,-0.15,0.15,-0.09
274
+ QIAN880116,-0.19,0.03,0.02,-0.06,-0.29,0.02,-0.10,0.19,-0.16,-0.08,-0.42,-0.09,-0.38,-0.32,0.05,0.25,0.22,-0.19,0.05,-0.15
275
+ QIAN880117,-0.14,0.14,-0.27,-0.10,-0.64,-0.11,-0.39,0.46,-0.04,0.16,-0.57,0.04,0.24,0.08,0.02,-0.12,0.00,-0.10,0.18,0.29
276
+ QIAN880118,-0.31,0.25,-0.53,-0.54,-0.06,0.07,-0.52,0.37,-0.32,0.57,0.09,-0.29,0.29,0.24,-0.31,0.11,0.03,0.15,0.29,0.48
277
+ QIAN880119,-0.10,0.19,-0.89,-0.89,0.13,-0.04,-0.34,-0.45,-0.34,0.95,0.32,-0.46,0.43,0.36,-0.91,-0.12,0.49,0.34,0.42,0.76
278
+ QIAN880120,-0.25,-0.02,-0.77,-1.01,0.13,-0.12,-0.62,-0.72,-0.16,1.10,0.23,-0.59,0.32,0.48,-1.24,-0.31,0.17,0.45,0.77,0.69
279
+ QIAN880121,-0.26,-0.09,-0.34,-0.55,0.47,-0.33,-0.75,-0.56,-0.04,0.94,0.25,-0.55,-0.05,0.20,-1.28,-0.28,0.08,0.22,0.53,0.67
280
+ QIAN880122,0.05,-0.11,-0.40,-0.11,0.36,-0.67,-0.35,0.14,0.02,0.47,0.32,-0.51,-0.10,0.20,-0.79,0.03,-0.15,0.09,0.34,0.58
281
+ QIAN880123,-0.44,-0.13,0.05,-0.20,0.13,-0.58,-0.28,0.08,0.09,-0.04,-0.12,-0.33,-0.21,-0.13,-0.48,0.27,0.47,-0.22,-0.11,0.06
282
+ QIAN880124,-0.31,-0.10,0.06,0.13,-0.11,-0.47,-0.05,0.45,-0.06,-0.25,-0.44,-0.44,-0.28,-0.04,-0.29,0.34,0.27,-0.08,0.06,0.11
283
+ QIAN880125,-0.02,0.04,0.03,0.11,-0.02,-0.17,0.10,0.38,-0.09,-0.48,-0.26,-0.39,-0.14,-0.03,-0.04,0.41,0.36,-0.01,-0.08,-0.18
284
+ QIAN880126,-0.06,0.02,0.10,0.24,-0.19,-0.04,-0.04,0.17,0.19,-0.20,-0.46,-0.43,-0.52,-0.33,0.37,0.43,0.50,-0.32,0.35,0.00
285
+ QIAN880127,-0.05,0.06,0.00,0.15,0.30,-0.08,-0.02,-0.14,-0.07,0.26,0.04,-0.42,0.25,0.09,0.31,-0.11,-0.06,0.19,0.33,0.04
286
+ QIAN880128,-0.19,0.17,-0.38,0.09,0.41,0.04,-0.20,0.28,-0.19,-0.06,0.34,-0.20,0.45,0.07,0.04,-0.23,-0.02,0.16,0.22,0.05
287
+ QIAN880129,-0.43,0.06,0.00,-0.31,0.19,0.14,-0.41,-0.21,0.21,0.29,-0.10,0.33,-0.01,0.25,0.28,-0.23,-0.26,0.15,0.09,-0.10
288
+ QIAN880130,-0.19,-0.07,0.17,-0.27,0.42,-0.29,-0.22,0.17,0.17,-0.34,-0.22,0.00,-0.53,-0.31,0.14,0.22,0.10,-0.15,-0.02,-0.33
289
+ QIAN880131,-0.25,0.12,0.61,0.60,0.18,0.09,-0.12,0.09,0.42,-0.54,-0.55,0.14,-0.47,-0.29,0.89,0.24,0.16,-0.44,-0.19,-0.45
290
+ QIAN880132,-0.27,-0.40,0.71,0.54,0.00,-0.08,-0.12,1.14,0.18,-0.74,-0.54,0.45,-0.76,-0.47,1.40,0.40,-0.10,-0.46,-0.05,-0.86
291
+ QIAN880133,-0.42,-0.23,0.81,0.95,-0.18,-0.01,-0.09,1.24,0.05,-1.17,-0.69,0.09,-0.86,-0.39,1.77,0.63,0.29,-0.37,-0.41,-1.32
292
+ QIAN880134,-0.24,-0.04,0.45,0.65,-0.38,0.01,0.07,0.85,-0.21,-0.65,-0.80,0.17,-0.71,-0.61,2.27,0.33,0.13,-0.44,-0.49,-0.99
293
+ QIAN880135,-0.14,0.21,0.35,0.66,-0.09,0.11,0.06,0.36,-0.31,-0.51,-0.80,-0.14,-0.56,-0.25,1.59,0.32,0.21,-0.17,-0.35,-0.70
294
+ QIAN880136,0.01,-0.13,-0.11,0.78,-0.31,-0.13,0.09,0.14,-0.56,-0.09,-0.81,-0.43,-0.49,-0.20,1.14,0.13,-0.02,-0.20,0.10,-0.11
295
+ QIAN880137,-0.30,-0.09,-0.12,0.44,0.03,0.24,0.18,-0.12,-0.20,-0.07,-0.18,0.06,-0.44,0.11,0.77,-0.09,-0.27,-0.09,-0.25,-0.06
296
+ QIAN880138,-0.23,-0.20,0.06,0.34,0.19,0.47,0.28,0.14,-0.22,0.42,-0.36,-0.15,-0.19,-0.02,0.78,-0.29,-0.30,-0.18,0.07,0.29
297
+ QIAN880139,0.08,-0.01,-0.06,0.04,0.37,0.48,0.36,-0.02,-0.45,0.09,0.24,-0.27,0.16,0.34,0.16,-0.35,-0.04,-0.06,-0.20,0.18
298
+ RACS770101,0.934,0.962,0.986,0.994,0.900,1.047,0.986,1.015,0.882,0.766,0.825,1.040,0.804,0.773,1.047,1.056,1.008,0.848,0.931,0.825
299
+ RACS770102,0.941,1.112,1.038,1.071,0.866,1.150,1.100,1.055,0.911,0.742,0.798,1.232,0.781,0.723,1.093,1.082,1.043,0.867,1.050,0.817
300
+ RACS770103,1.16,1.72,1.97,2.66,0.50,3.87,2.40,1.63,0.86,0.57,0.51,3.90,0.40,0.43,2.04,1.61,1.48,0.75,1.72,0.59
301
+ RACS820101,0.85,2.02,0.88,1.50,0.90,1.71,1.79,1.54,1.59,0.67,1.03,0.88,1.17,0.85,1.47,1.50,1.96,0.83,1.34,0.89
302
+ RACS820102,1.58,1.14,0.77,0.98,1.04,1.24,1.49,0.66,0.99,1.09,1.21,1.27,1.41,1.00,1.46,1.05,0.87,1.23,0.68,0.88
303
+ RACS820103,0.82,2.60,2.07,2.64,0.00,0.00,2.62,1.63,0.00,2.32,0.00,2.86,0.00,0.00,0.00,1.23,2.48,0.00,1.90,1.62
304
+ RACS820104,0.78,1.75,1.32,1.25,3.14,0.93,0.94,1.13,1.03,1.26,0.91,0.85,0.41,1.07,1.73,1.31,1.57,0.98,1.31,1.11
305
+ RACS820105,0.88,0.99,1.02,1.16,1.14,0.93,1.01,0.70,1.87,1.61,1.09,0.83,1.71,1.52,0.87,1.14,0.96,1.96,1.68,1.56
306
+ RACS820106,0.30,0.90,2.73,1.26,0.72,0.97,1.33,3.09,1.33,0.45,0.96,0.71,1.89,1.20,0.83,1.16,0.97,1.58,0.86,0.64
307
+ RACS820107,0.40,1.20,1.24,1.59,2.98,0.50,1.26,1.89,2.71,1.31,0.57,0.87,0.00,1.27,0.38,0.92,1.38,1.53,1.79,0.95
308
+ RACS820108,1.48,1.02,0.99,1.19,0.86,1.42,1.43,0.46,1.27,1.12,1.33,1.36,1.41,1.30,0.25,0.89,0.81,1.27,0.91,0.93
309
+ RACS820109,0.00,0.00,4.14,2.15,0.00,0.00,0.00,6.49,0.00,0.00,0.00,0.00,0.00,2.11,1.99,0.00,1.24,0.00,1.90,0.00
310
+ RACS820110,1.02,1.00,1.31,1.76,1.05,1.05,0.83,2.39,0.40,0.83,1.06,0.94,1.33,0.41,2.73,1.18,0.77,1.22,1.09,0.88
311
+ RACS820111,0.93,1.52,0.92,0.60,1.08,0.94,0.73,0.78,1.08,1.74,1.03,1.00,1.31,1.51,1.37,0.97,1.38,1.12,1.65,1.70
312
+ RACS820112,0.99,1.19,1.15,1.18,2.32,1.52,1.36,1.40,1.06,0.81,1.26,0.91,1.00,1.25,0.00,1.50,1.18,1.33,1.09,1.01
313
+ RACS820113,17.05,21.25,34.81,19.27,28.84,15.42,20.12,38.14,23.07,16.66,10.89,16.46,20.61,16.26,23.94,19.95,18.92,23.36,26.49,17.06
314
+ RACS820114,14.53,17.82,13.59,19.78,30.57,22.18,18.19,37.16,22.63,20.28,14.30,14.07,20.61,19.61,52.63,18.56,21.09,19.78,26.36,21.87
315
+ RADA880101,1.81,-14.92,-6.64,-8.72,1.28,-5.54,-6.81,0.94,-4.66,4.92,4.92,-5.55,2.35,2.98,0.,-3.40,-2.57,2.33,-0.14,4.04
316
+ RADA880102,0.52,-1.32,-0.01,0.,0.,-0.07,-0.79,0.,0.95,2.04,1.76,0.08,1.32,2.09,0.,0.04,0.27,2.51,1.63,1.18
317
+ RADA880103,0.13,-5.,-3.04,-2.23,-2.52,-3.84,-3.43,1.45,-5.61,-2.77,-2.64,-3.97,-3.83,-3.74,0.,-1.66,-2.31,-8.21,-5.97,-2.05
318
+ RADA880104,1.29,-13.60,-6.63,0.,0.,-5.47,-6.02,0.94,-5.61,2.88,3.16,-5.63,1.03,0.89,0.,-3.44,-2.84,-0.18,-1.77,2.86
319
+ RADA880105,1.42,-18.60,-9.67,0.,0.,-9.31,-9.45,2.39,-11.22,0.11,0.52,-9.60,-2.80,-2.85,0.,-5.10,-5.15,-8.39,-7.74,0.81
320
+ RADA880106,93.7,250.4,146.3,142.6,135.2,177.7,182.9,52.6,188.1,182.2,173.7,215.2,197.6,228.6,0.,109.5,142.1,271.6,239.9,157.2
321
+ RADA880107,-0.29,-2.71,-1.18,-1.02,0.,-1.53,-0.90,-0.34,-0.94,0.24,-0.12,-2.05,-0.24,0.,0.,-0.75,-0.71,-0.59,-1.02,0.09
322
+ RADA880108,-0.06,-0.84,-0.48,-0.80,1.36,-0.73,-0.77,-0.41,0.49,1.31,1.21,-1.18,1.27,1.27,0.,-0.50,-0.27,0.88,0.33,1.09
323
+ RICJ880101,0.7,0.4,1.2,1.4,0.6,1.,1.,1.6,1.2,0.9,0.9,1.,0.3,1.2,0.7,1.6,0.3,1.1,1.9,0.7
324
+ RICJ880102,0.7,0.4,1.2,1.4,0.6,1.,1.,1.6,1.2,0.9,0.9,1.,0.3,1.2,0.7,1.6,0.3,1.1,1.9,0.7
325
+ RICJ880103,0.5,0.4,3.5,2.1,0.6,0.4,0.4,1.8,1.1,0.2,0.2,0.7,0.8,0.2,0.8,2.3,1.6,0.3,0.8,0.1
326
+ RICJ880104,1.2,0.7,0.7,0.8,0.8,0.7,2.2,0.3,0.7,0.9,0.9,0.6,0.3,0.5,2.6,0.7,0.8,2.1,1.8,1.1
327
+ RICJ880105,1.6,0.9,0.7,2.6,1.2,0.8,2.,0.9,0.7,0.7,0.3,1.,1.,0.9,0.5,0.8,0.7,1.7,0.4,0.6
328
+ RICJ880106,1.,0.4,0.7,2.2,0.6,1.5,3.3,0.6,0.7,0.4,0.6,0.8,1.,0.6,0.4,0.4,1.,1.4,1.2,1.1
329
+ RICJ880107,1.1,1.5,0.,0.3,1.1,1.3,0.5,0.4,1.5,1.1,2.6,0.8,1.7,1.9,0.1,0.4,0.5,3.1,0.6,1.5
330
+ RICJ880108,1.4,1.2,1.2,0.6,1.6,1.4,0.9,0.6,0.9,0.9,1.1,1.9,1.7,1.,0.3,1.1,0.6,1.4,0.2,0.8
331
+ RICJ880109,1.8,1.3,0.9,1.,0.7,1.3,0.8,0.5,1.,1.2,1.2,1.1,1.5,1.3,0.3,0.6,1.,1.5,0.8,1.2
332
+ RICJ880110,1.8,1.,0.6,0.7,0.,1.,1.1,0.5,2.4,1.3,1.2,1.4,2.7,1.9,0.3,0.5,0.5,1.1,1.3,0.4
333
+ RICJ880111,1.3,0.8,0.6,0.5,0.7,0.2,0.7,0.5,1.9,1.6,1.4,1.,2.8,2.9,0.,0.5,0.6,2.1,0.8,1.4
334
+ RICJ880112,0.7,0.8,0.8,0.6,0.2,1.3,1.6,0.1,1.1,1.4,1.9,2.2,1.,1.8,0.,0.6,0.7,0.4,1.1,1.3
335
+ RICJ880113,1.4,2.1,0.9,0.7,1.2,1.6,1.7,0.2,1.8,0.4,0.8,1.9,1.3,0.3,0.2,1.6,0.9,0.4,0.3,0.7
336
+ RICJ880114,1.1,1.,1.2,0.4,1.6,2.1,0.8,0.2,3.4,0.7,0.7,2.,1.,0.7,0.,1.7,1.,0.,1.2,0.7
337
+ RICJ880115,0.8,0.9,1.6,0.7,0.4,0.9,0.3,3.9,1.3,0.7,0.7,1.3,0.8,0.5,0.7,0.8,0.3,0.,0.8,0.2
338
+ RICJ880116,1.,1.4,0.9,1.4,0.8,1.4,0.8,1.2,1.2,1.1,0.9,1.2,0.8,0.1,1.9,0.7,0.8,0.4,0.9,0.6
339
+ RICJ880117,0.7,1.1,1.5,1.4,0.4,1.1,0.7,0.6,1.,0.7,0.5,1.3,0.,1.2,1.5,0.9,2.1,2.7,0.5,1.
340
+ ROBB760101,6.5,-0.9,-5.1,0.5,-1.3,1.0,7.8,-8.6,1.2,0.6,3.2,2.3,5.3,1.6,-7.7,-3.9,-2.6,1.2,-4.5,1.4
341
+ ROBB760102,2.3,-5.2,0.3,7.4,0.8,-0.7,10.3,-5.2,-2.8,-4.0,-2.1,-4.1,-3.5,-1.1,8.1,-3.5,2.3,-0.9,-3.7,-4.4
342
+ ROBB760103,6.7,0.3,-6.1,-3.1,-4.9,0.6,2.2,-6.8,-1.0,3.2,5.5,0.5,7.2,2.8,-22.8,-3.0,-4.0,4.0,-4.6,2.5
343
+ ROBB760104,2.3,1.4,-3.3,-4.4,6.1,2.7,2.5,-8.3,5.9,-0.5,0.1,7.3,3.5,1.6,-24.4,-1.9,-3.7,-0.9,-0.6,2.3
344
+ ROBB760105,-2.3,0.4,-4.1,-4.4,4.4,1.2,-5.0,-4.2,-2.5,6.7,2.3,-3.3,2.3,2.6,-1.8,-1.7,1.3,-1.0,4.0,6.8
345
+ ROBB760106,-2.7,0.4,-4.2,-4.4,3.7,0.8,-8.1,-3.9,-3.0,7.7,3.7,-2.9,3.7,3.0,-6.6,-2.4,1.7,0.3,3.3,7.1
346
+ ROBB760107,0.0,1.1,-2.0,-2.6,5.4,2.4,3.1,-3.4,0.8,-0.1,-3.7,-3.1,-2.1,0.7,7.4,1.3,0.0,-3.4,4.8,2.7
347
+ ROBB760108,-5.0,2.1,4.2,3.1,4.4,0.4,-4.7,5.7,-0.3,-4.6,-5.6,1.0,-4.8,-1.8,2.6,2.6,0.3,3.4,2.9,-6.0
348
+ ROBB760109,-3.3,0.0,5.4,3.9,-0.3,-0.4,-1.8,-1.2,3.0,-0.5,-2.3,-1.2,-4.3,0.8,6.5,1.8,-0.7,-0.8,3.1,-3.5
349
+ ROBB760110,-4.7,2.0,3.9,1.9,6.2,-2.0,-4.2,5.7,-2.6,-7.0,-6.2,2.8,-4.8,-3.7,3.6,2.1,0.6,3.3,3.8,-6.2
350
+ ROBB760111,-3.7,1.0,-0.6,-0.6,4.0,3.4,-4.3,5.9,-0.8,-0.5,-2.8,1.3,-1.6,1.6,-6.0,1.5,1.2,6.5,1.3,-4.6
351
+ ROBB760112,-2.5,-1.2,4.6,0.0,-4.7,-0.5,-4.4,4.9,1.6,-3.3,-2.0,-0.8,-4.1,-4.1,5.8,2.5,1.7,1.2,-0.6,-3.5
352
+ ROBB760113,-5.1,2.6,4.7,3.1,3.8,0.2,-5.2,5.6,-0.9,-4.5,-5.4,1.0,-5.3,-2.4,3.5,3.2,0.0,2.9,3.2,-6.3
353
+ ROBB790101,-1.0,0.3,-0.7,-1.2,2.1,-0.1,-0.7,0.3,1.1,4.0,2.0,-0.9,1.8,2.8,0.4,-1.2,-0.5,3.0,2.1,1.4
354
+ ROSG850101,86.6,162.2,103.3,97.8,132.3,119.2,113.9,62.9,155.8,158.0,164.1,115.5,172.9,194.1,92.9,85.6,106.5,224.6,177.7,141.0
355
+ ROSG850102,0.74,0.64,0.63,0.62,0.91,0.62,0.62,0.72,0.78,0.88,0.85,0.52,0.85,0.88,0.64,0.66,0.70,0.85,0.76,0.86
356
+ ROSM880101,-0.67,12.1,7.23,8.72,-0.34,6.39,7.35,0.00,3.82,-3.02,-3.02,6.13,-1.30,-3.24,-1.75,4.35,3.86,-2.86,0.98,-2.18
357
+ ROSM880102,-0.67,3.89,2.27,1.57,-2.00,2.12,1.78,0.00,1.09,-3.02,-3.02,2.46,-1.67,-3.24,-1.75,0.10,-0.42,-2.86,0.98,-2.18
358
+ ROSM880103,0.4,0.3,0.9,0.8,0.5,0.7,1.3,0.0,1.0,0.4,0.6,0.4,0.3,0.7,0.9,0.4,0.4,0.6,1.2,0.4
359
+ SIMZ760101,0.73,0.73,-0.01,0.54,0.70,-0.10,0.55,0.00,1.10,2.97,2.49,1.50,1.30,2.65,2.60,0.04,0.44,3.00,2.97,1.69
360
+ SNEP660101,0.239,0.211,0.249,0.171,0.220,0.260,0.187,0.160,0.205,0.273,0.281,0.228,0.253,0.234,0.165,0.236,0.213,0.183,0.193,0.255
361
+ SNEP660102,0.330,-0.176,-0.233,-0.371,0.074,-0.254,-0.409,0.370,-0.078,0.149,0.129,-0.075,-0.092,-0.011,0.370,0.022,0.136,-0.011,-0.138,0.245
362
+ SNEP660103,-0.110,0.079,-0.136,-0.285,-0.184,-0.067,-0.246,-0.073,0.320,0.001,-0.008,0.049,-0.041,0.438,-0.016,-0.153,-0.208,0.493,0.381,-0.155
363
+ SNEP660104,-0.062,-0.167,0.166,-0.079,0.380,-0.025,-0.184,-0.017,0.056,-0.309,-0.264,-0.371,0.077,0.074,-0.036,0.470,0.348,0.050,0.220,-0.212
364
+ SUEM840101,1.071,1.033,0.784,0.680,0.922,0.977,0.970,0.591,0.850,1.140,1.140,0.939,1.200,1.086,0.659,0.760,0.817,1.107,1.020,0.950
365
+ SUEM840102,8.0,0.1,0.1,70.0,26.0,33.0,6.0,0.1,0.1,55.0,33.0,1.0,54.0,18.0,42.0,0.1,0.1,77.0,66.0,0.1
366
+ SWER830101,-0.40,-0.59,-0.92,-1.31,0.17,-0.91,-1.22,-0.67,-0.64,1.25,1.22,-0.67,1.02,1.92,-0.49,-0.55,-0.28,0.50,1.67,0.91
367
+ TANS770101,1.42,1.06,0.71,1.01,0.73,1.02,1.63,0.50,1.20,1.12,1.29,1.24,1.21,1.16,0.65,0.71,0.78,1.05,0.67,0.99
368
+ TANS770102,0.946,1.128,0.432,1.311,0.481,1.615,0.698,0.360,2.168,1.283,1.192,1.203,0.000,0.963,2.093,0.523,1.961,1.925,0.802,0.409
369
+ TANS770103,0.790,1.087,0.832,0.530,1.268,1.038,0.643,0.725,0.864,1.361,1.111,0.735,1.092,1.052,1.249,1.093,1.214,1.114,1.340,1.428
370
+ TANS770104,1.194,0.795,0.659,1.056,0.678,1.290,0.928,1.015,0.611,0.603,0.595,1.060,0.831,0.377,3.159,1.444,1.172,0.452,0.816,0.640
371
+ TANS770105,0.497,0.677,2.072,1.498,1.348,0.711,0.651,1.848,1.474,0.471,0.656,0.932,0.425,1.348,0.179,1.151,0.749,1.283,1.283,0.654
372
+ TANS770106,0.937,1.725,1.080,1.640,1.004,1.078,0.679,0.901,1.085,0.178,0.808,1.254,0.886,0.803,0.748,1.145,1.487,0.803,1.227,0.625
373
+ TANS770107,0.289,1.380,3.169,0.917,1.767,2.372,0.285,4.259,1.061,0.262,0.000,1.288,0.000,0.393,0.000,0.160,0.218,0.000,0.654,0.167
374
+ TANS770108,0.328,2.088,1.498,3.379,0.000,0.000,0.000,0.500,1.204,2.078,0.414,0.835,0.982,1.336,0.415,1.089,1.732,1.781,0.000,0.946
375
+ TANS770109,0.945,0.364,1.202,1.315,0.932,0.704,1.014,2.355,0.525,0.673,0.758,0.947,1.028,0.622,0.579,1.140,0.863,0.777,0.907,0.561
376
+ TANS770110,0.842,0.936,1.352,1.366,1.032,0.998,0.758,1.349,1.079,0.459,0.665,1.045,0.668,0.881,1.385,1.257,1.055,0.881,1.101,0.643
377
+ VASM830101,0.135,0.296,0.196,0.289,0.159,0.236,0.184,0.051,0.223,0.173,0.215,0.170,0.239,0.087,0.151,0.010,0.100,0.166,0.066,0.285
378
+ VASM830102,0.507,0.459,0.287,0.223,0.592,0.383,0.445,0.390,0.310,0.111,0.619,0.559,0.431,0.077,0.739,0.689,0.785,0.160,0.060,0.356
379
+ VASM830103,0.159,0.194,0.385,0.283,0.187,0.236,0.206,0.049,0.233,0.581,0.083,0.159,0.198,0.682,0.366,0.150,0.074,0.463,0.737,0.301
380
+ VELV850101,.03731,.09593,.00359,.12630,.08292,.07606,.00580,.00499,.02415,.00000,.00000,.03710,.08226,.09460,.01979,.08292,.09408,.05481,.05159,.00569
381
+ VENT840101,0.,0.,0.,0.,0.,0.,0.,0.,0.,1.,1.,0.,0.,1.,0.,0.,0.,1.,1.,1.
382
+ VHEG790101,-12.04,39.23,4.25,23.22,3.95,2.16,16.81,-7.85,6.28,-18.32,-17.79,9.71,-8.86,-21.98,5.82,-1.54,-4.15,-16.19,-1.51,-16.22
383
+ WARP780101,10.04,6.18,5.63,5.76,8.89,5.41,5.37,7.99,7.49,8.72,8.79,4.40,9.15,7.98,7.79,7.08,7.00,8.07,6.90,8.88
384
+ WEBA780101,0.89,0.88,0.89,0.87,0.85,0.82,0.84,0.92,0.83,0.76,0.73,0.97,0.74,0.52,0.82,0.96,0.92,0.20,0.49,0.85
385
+ WERD780101,0.52,0.49,0.42,0.37,0.83,0.35,0.38,0.41,0.70,0.79,0.77,0.31,0.76,0.87,0.35,0.49,0.38,0.86,0.64,0.72
386
+ WERD780102,0.16,-0.20,1.03,-0.24,-0.12,-0.55,-0.45,-0.16,-0.18,-0.19,-0.44,-0.12,-0.79,-0.25,-0.59,-0.01,0.05,-0.33,-0.42,-0.46
387
+ WERD780103,0.15,-0.37,0.69,-0.22,-0.19,-0.06,0.14,0.36,-0.25,0.02,0.06,-0.16,0.11,1.18,0.11,0.13,0.28,-0.12,0.19,-0.08
388
+ WERD780104,-0.07,-0.40,-0.57,-0.80,0.17,-0.26,-0.63,0.27,-0.49,0.06,-0.17,-0.45,0.03,0.40,-0.47,-0.11,0.09,-0.61,-0.61,-0.11
389
+ WOEC730101,7.0,9.1,10.0,13.0,5.5,8.6,12.5,7.9,8.4,4.9,4.9,10.1,5.3,5.0,6.6,7.5,6.6,5.3,5.7,5.6
390
+ WOLR810101,1.94,-19.92,-9.68,-10.95,-1.24,-9.38,-10.20,2.39,-10.27,2.15,2.28,-9.52,-1.48,-0.76,-3.68,-5.06,-4.88,-5.88,-6.11,1.99
391
+ WOLS870101,0.07,2.88,3.22,3.64,0.71,2.18,3.08,2.23,2.41,-4.44,-4.19,2.84,-2.49,-4.92,-1.22,1.96,0.92,-4.75,-1.39,-2.69
392
+ WOLS870102,-1.73,2.52,1.45,1.13,-0.97,0.53,0.39,-5.36,1.74,-1.68,-1.03,1.41,-0.27,1.30,0.88,-1.63,-2.09,3.65,2.32,-2.53
393
+ WOLS870103,0.09,-3.44,0.84,2.36,4.13,-1.14,-0.07,0.30,1.11,-1.03,-0.98,-3.14,-0.41,0.45,2.23,0.57,-1.40,0.85,0.01,-1.29
394
+ YUTK870101,8.5,0.,8.2,8.5,11.0,6.3,8.8,7.1,10.1,16.8,15.0,7.9,13.3,11.2,8.2,7.4,8.8,9.9,8.8,12.0
395
+ YUTK870102,6.8,0.,6.2,7.0,8.3,8.5,4.9,6.4,9.2,10.0,12.2,7.5,8.4,8.3,6.9,8.0,7.0,5.7,6.8,9.4
396
+ YUTK870103,18.08,0.,17.47,17.36,18.17,17.93,18.16,18.24,18.49,18.62,18.60,17.96,18.11,17.30,18.16,17.57,17.54,17.19,17.99,18.30
397
+ YUTK870104,18.56,0.,18.24,17.94,17.84,18.51,17.97,18.57,18.64,19.21,19.01,18.36,18.49,17.95,18.77,18.06,17.71,16.87,18.23,18.98
398
+ ZASB820101,-0.152,-0.089,-0.203,-0.355,0.,-0.181,-0.411,-0.190,0.,-0.086,-0.102,-0.062,-0.107,0.001,-0.181,-0.203,-0.170,0.275,0.,-0.125
399
+ ZIMJ680101,0.83,0.83,0.09,0.64,1.48,0.00,0.65,0.10,1.10,3.07,2.52,1.60,1.40,2.75,2.70,0.14,0.54,0.31,2.97,1.79
400
+ ZIMJ680102,11.50,14.28,12.82,11.68,13.46,14.45,13.57,3.40,13.69,21.40,21.40,15.71,16.25,19.80,17.43,9.47,15.77,21.67,18.03,21.57
401
+ ZIMJ680103,0.00,52.00,3.38,49.70,1.48,3.53,49.90,0.00,51.60,0.13,0.13,49.50,1.43,0.35,1.58,1.67,1.66,2.10,1.61,0.13
402
+ ZIMJ680104,6.00,10.76,5.41,2.77,5.05,5.65,3.22,5.97,7.59,6.02,5.98,9.74,5.74,5.48,6.30,5.68,5.66,5.89,5.66,5.96
403
+ ZIMJ680105,9.9,4.6,5.4,2.8,2.8,9.0,3.2,5.6,8.2,17.1,17.6,3.5,14.9,18.8,14.8,6.9,9.5,17.1,15.0,14.3
404
+ AURR980101,0.94,1.15,0.79,1.19,0.60,0.94,1.41,1.18,1.15,1.07,0.95,1.03,0.88,1.06,1.18,0.69,0.87,0.91,1.04,0.90
405
+ AURR980102,0.98,1.14,1.05,1.05,0.41,0.90,1.04,1.25,1.01,0.88,0.80,1.06,1.12,1.12,1.31,1.02,0.80,0.90,1.12,0.87
406
+ AURR980103,1.05,0.81,0.91,1.39,0.60,0.87,1.11,1.26,1.43,0.95,0.96,0.97,0.99,0.95,1.05,0.96,1.03,1.06,0.94,0.62
407
+ AURR980104,0.75,0.90,1.24,1.72,0.66,1.08,1.10,1.14,0.96,0.80,1.01,0.66,1.02,0.88,1.33,1.20,1.13,0.68,0.80,0.58
408
+ AURR980105,0.67,0.76,1.28,1.58,0.37,1.05,0.94,0.98,0.83,0.78,0.79,0.84,0.98,0.96,1.12,1.25,1.41,0.94,0.82,0.67
409
+ AURR980106,1.10,1.05,0.72,1.14,0.26,1.31,2.30,0.55,0.83,1.06,0.84,1.08,0.90,0.90,1.67,0.81,0.77,1.26,0.99,0.76
410
+ AURR980107,1.39,0.95,0.67,1.64,0.52,1.60,2.07,0.65,1.36,0.64,0.91,0.80,1.10,1.00,0.94,0.69,0.92,1.10,0.73,0.70
411
+ AURR980108,1.43,1.33,0.55,0.90,0.52,1.43,1.70,0.56,0.66,1.18,1.52,0.82,1.68,1.10,0.15,0.61,0.75,1.68,0.65,1.14
412
+ AURR980109,1.55,1.39,0.60,0.61,0.59,1.43,1.34,0.37,0.89,1.47,1.36,1.27,2.13,1.39,0.03,0.44,0.65,1.10,0.93,1.18
413
+ AURR980110,1.80,1.73,0.73,0.90,0.55,0.97,1.73,0.32,0.46,1.09,1.47,1.24,1.64,0.96,0.15,0.67,0.70,0.68,0.91,0.81
414
+ AURR980111,1.52,1.49,0.58,1.04,0.26,1.41,1.76,0.30,0.83,1.25,1.26,1.10,1.14,1.14,0.44,0.66,0.73,0.68,1.04,1.03
415
+ AURR980112,1.49,1.41,0.67,0.94,0.37,1.52,1.55,0.29,0.96,1.04,1.40,1.17,1.84,0.86,0.20,0.68,0.79,1.52,1.06,0.94
416
+ AURR980113,1.73,1.24,0.70,0.68,0.63,0.88,1.16,0.32,0.76,1.15,1.80,1.22,2.21,1.35,0.07,0.65,0.46,1.57,1.10,0.94
417
+ AURR980114,1.33,1.39,0.64,0.60,0.44,1.37,1.43,0.20,1.02,1.58,1.63,1.71,1.76,1.22,0.07,0.42,0.57,1.00,1.02,1.08
418
+ AURR980115,1.87,1.66,0.70,0.91,0.33,1.24,1.88,0.33,0.89,0.90,1.65,1.63,1.35,0.67,0.03,0.71,0.50,1.00,0.73,0.51
419
+ AURR980116,1.19,1.45,1.33,0.72,0.44,1.43,1.27,0.74,1.55,0.61,1.36,1.45,1.35,1.20,0.10,1.02,0.82,0.58,1.06,0.46
420
+ AURR980117,0.77,1.11,1.39,0.79,0.44,0.95,0.92,2.74,1.65,0.64,0.66,1.19,0.74,1.04,0.66,0.64,0.82,0.58,0.93,0.53
421
+ AURR980118,0.93,0.96,0.82,1.15,0.67,1.02,1.07,1.08,1.40,1.14,1.16,1.27,1.11,1.05,1.01,0.71,0.84,1.06,1.15,0.74
422
+ AURR980119,1.09,1.29,1.03,1.17,0.26,1.08,1.31,0.97,0.88,0.97,0.87,1.13,0.96,0.84,2.01,0.76,0.79,0.91,0.64,0.77
423
+ AURR980120,0.71,1.09,0.95,1.43,0.65,0.87,1.19,1.07,1.13,1.05,0.84,1.10,0.80,0.95,1.70,0.65,.086,1.25,0.85,1.12
424
+ ONEK900101,13.4,13.3,12.0,11.7,11.6,12.8,12.2,11.3,11.6,12.0,13.0,13.0,12.8,12.1,6.5,12.2,11.7,12.4,12.1,11.9
425
+ ONEK900102,-0.77,-0.68,-0.07,-0.15,-0.23,-0.33,-0.27,0.00,-0.06,-0.23,-0.62,-0.65,-0.50,-0.41,3,-0.35,-0.11,-0.45,-0.17,-0.14
426
+ VINM940101,0.984,1.008,1.048,1.068,0.906,1.037,1.094,1.031,0.950,0.927,0.935,1.102,0.952,0.915,1.049,1.046,0.997,0.904,0.929,0.931
427
+ VINM940102,1.315,1.310,1.380,1.372,1.196,1.342,1.376,1.382,1.279,1.241,1.234,1.367,1.269,1.247,1.342,1.381,1.324,1.186,1.199,1.235
428
+ VINM940103,0.994,1.026,1.022,1.022,0.939,1.041,1.052,1.018,0.967,0.977,0.982,1.029,0.963,0.934,1.050,1.025,0.998,0.938,0.981,0.968
429
+ VINM940104,0.783,0.807,0.799,0.822,0.785,0.817,0.826,0.784,0.777,0.776,0.783,0.834,0.806,0.774,0.809,0.811,0.795,0.796,0.788,0.781
430
+ MUNV940101,0.423,0.503,0.906,0.870,0.877,0.594,0.167,1.162,0.802,0.566,0.494,0.615,0.444,0.706,1.945,0.928,0.884,0.690,0.778,0.706
431
+ MUNV940102,0.619,0.753,1.089,0.932,1.107,0.770,0.675,1.361,1.034,0.876,0.740,0.784,0.736,0.968,1.780,0.969,1.053,0.910,1.009,0.939
432
+ MUNV940103,1.080,0.976,1.197,1.266,0.733,1.050,1.085,1.104,0.906,0.583,0.789,1.026,0.812,0.685,1.412,0.987,0.784,0.755,0.665,0.546
433
+ MUNV940104,0.978,0.784,0.915,1.038,0.573,0.863,0.962,1.405,0.724,0.502,0.766,0.841,0.729,0.585,2.613,0.784,0.569,0.671,0.560,0.444
434
+ MUNV940105,1.40,1.23,1.61,1.89,1.14,1.33,1.42,2.06,1.25,1.02,1.33,1.34,1.12,1.07,3.90,1.20,0.99,1.10,0.98,0.87
435
+ WIMW960101,4.08,3.91,3.83,3.02,4.49,3.67,2.23,4.24,4.08,4.52,4.81,3.77,4.48,5.38,3.80,4.12,4.11,6.10,5.19,4.18
436
+ KIMC930101,-0.35,-0.44,-0.38,-0.41,-0.47,-0.40,-0.41,0.0,-0.46,-0.56,-0.48,-0.41,-0.46,-0.55,-0.23,-0.39,-0.48,-0.48,-0.50,-0.53
437
+ MONM990101,0.5,1.7,1.7,1.6,0.6,1.6,1.6,1.3,1.6,0.6,0.4,1.6,0.5,0.4,1.7,0.7,0.4,0.7,0.6,0.5
438
+ BLAM930101,0.96,0.77,0.39,0.42,0.42,0.80,0.53,0.00,0.57,0.84,0.92,0.73,0.86,0.59,-2.50,0.53,0.54,0.58,0.72,0.63
439
+ PARS000101,0.343,0.353,0.409,0.429,0.319,0.395,0.405,0.389,0.307,0.296,0.287,0.429,0.293,0.292,0.432,0.416,0.362,0.268,0.22,0.307
440
+ PARS000102,0.320,0.327,0.384,0.424,0.198,0.436,0.514,0.374,0.299,0.306,0.340,0.446,0.313,0.314,0.354,0.376,0.339,0.291,0.287,0.294
441
+ KUMS000101,8.9,4.6,4.4,6.3,0.6,2.8,6.9,9.4,2.2,7.0,7.4,6.1,2.3,3.3,4.2,4.0,5.7,1.3,4.5,8.2
442
+ KUMS000102,9.2,3.6,5.1,6.0,1.0,2.9,6.0,9.4,2.1,6.0,7.7,6.5,2.4,3.4,4.2,5.5,5.7,1.2,3.7,8.2
443
+ KUMS000103,14.1,5.5,3.2,5.7,0.1,3.7,8.8,4.1,2.0,7.1,9.1,7.7,3.3,5.0,0.7,3.9,4.4,1.2,4.5,5.9
444
+ KUMS000104,13.4,3.9,3.7,4.6,0.8,4.8,7.8,4.6,3.3,6.5,10.6,7.5,3.0,4.5,1.3,3.8,4.6,1.0,3.3,7.1
445
+ TAKK010101,9.8,7.3,3.6,4.9,3.0,2.4,4.4,0,11.9,17.2,17.0,10.5,11.9,23.0,15.0,2.6,6.9,24.2,17.2,15.3
446
+ FODM020101,0.70,0.95,1.47,0.87,1.17,0.73,0.96,0.64,1.39,1.29,1.44,0.91,0.91,1.34,0.12,0.84,0.74,1.80,1.68,1.20
447
+ NADH010101,58,-184,-93,-97,116,-139,-131,-11,-73,107,95,-24,78,92,-79,-34,-7,59,-11,100
448
+ NADH010102,51,-144,-84,-78,137,-128,-115,-13,-55,106,103,-205,73,108,-79,-26,-3,69,11,108
449
+ NADH010103,41,-109,-74,-47,169,-104,-90,-18,-35,104,103,-148,77,128,-81,-31,10,102,36,116
450
+ NADH010104,32,-95,-73,-29,182,-95,-74,-22,-25,106,104,-124,82,132,-82,-34,20,118,44,113
451
+ NADH010105,24,-79,-76,0,194,-87,-57,-28,-31,102,103,-9,90,131,-85,-36,34,116,43,111
452
+ NADH010106,5,-57,-77,45,224,-67,-8,-47,-50,83,82,-38,83,117,-103,-41,79,130,27,117
453
+ NADH010107,-2,-41,-97,248,329,-37,117,-66,-70,28,36,115,62,120,-132,-52,174,179,-7,114
454
+ MONM990201,0.4,1.5,1.6,1.5,0.7,1.4,1.3,1.1,1.4,0.5,0.3,1.4,0.5,0.3,1.6,0.9,0.7,0.9,0.9,0.4
455
+ KOEP990101,-0.04,-0.30,0.25,0.27,0.57,-0.02,-0.33,1.24,-0.11,-0.26,-0.38,-0.18,-0.09,-0.01,0.,0.15,0.39,0.21,0.05,-0.06
456
+ KOEP990102,-0.12,0.34,1.05,1.12,-0.63,1.67,0.91,0.76,1.34,-0.77,0.15,0.29,-0.71,-0.67,0.,1.45,-0.70,-0.14,-0.49,-0.70
457
+ CEDJ970101,8.6,4.2,4.6,4.9,2.9,4.0,5.1,7.8,2.1,4.6,8.8,6.3,2.5,3.7,4.9,7.3,6.0,1.4,3.6,6.7
458
+ CEDJ970102,7.6,5.0,4.4,5.2,2.2,4.1,6.2,6.9,2.1,5.1,9.4,5.8,2.1,4.0,5.4,7.2,6.1,1.4,3.2,6.7
459
+ CEDJ970103,8.1,4.6,3.7,3.8,2.0,3.1,4.6,7.0,2.0,6.7,11.0,4.4,2.8,5.6,4.7,7.3,5.6,1.8,3.3,7.7
460
+ CEDJ970104,7.9,4.9,4.0,5.5,1.9,4.4,7.1,7.1,2.1,5.2,8.6,6.7,2.4,3.9,5.3,6.6,5.3,1.2,3.1,6.8
461
+ CEDJ970105,8.3,8.7,3.7,4.7,1.6,4.7,6.5,6.3,2.1,3.7,7.4,7.9,2.3,2.7,6.9,8.8,5.1,0.7,2.4,5.3
462
+ FUKS010101,4.47,8.48,3.89,7.05,0.29,2.87,16.56,8.29,1.74,3.30,5.06,12.98,1.71,2.32,5.41,4.27,3.83,0.67,2.75,4.05
463
+ FUKS010102,6.77,6.87,5.50,8.57,0.31,5.24,12.93,7.95,2.80,2.72,4.43,10.20,1.87,1.92,4.79,5.41,5.36,0.54,2.26,3.57
464
+ FUKS010103,7.43,4.51,9.12,8.71,0.42,5.42,5.86,9.40,1.49,1.76,2.74,9.67,0.60,1.18,5.60,9.60,8.95,1.18,3.26,3.10
465
+ FUKS010104,5.22,7.30,6.06,7.91,1.01,6.00,10.66,5.81,2.27,2.36,4.52,12.68,1.85,1.68,5.70,6.99,5.16,0.56,2.16,4.10
466
+ FUKS010105,9.88,3.71,2.35,3.50,1.12,1.66,4.02,6.88,1.88,10.08,13.21,3.39,2.44,5.27,3.80,4.10,4.98,1.11,4.07,12.53
467
+ FUKS010106,10.98,3.26,2.85,3.37,1.47,2.30,3.51,7.48,2.20,9.74,12.79,2.54,3.10,4.97,3.42,4.93,5.55,1.28,3.55,10.69
468
+ FUKS010107,9.95,3.05,4.84,4.46,1.30,2.64,2.58,8.87,1.99,7.73,9.66,2.00,2.45,5.41,3.20,6.03,5.62,2.60,6.15,9.46
469
+ FUKS010108,8.26,2.80,2.54,2.80,2.67,2.86,2.67,5.62,1.98,8.95,16.46,1.89,2.67,7.32,3.30,6.00,5.00,2.01,3.96,10.24
470
+ FUKS010109,7.39,5.91,3.06,5.14,0.74,2.22,9.80,7.53,1.82,6.96,9.45,7.81,2.10,3.91,4.54,4.18,4.45,0.90,3.46,8.62
471
+ FUKS010110,9.07,4.90,4.05,5.73,0.95,3.63,7.77,7.69,2.47,6.56,9.00,6.01,2.54,3.59,4.04,5.15,5.46,0.95,2.96,7.47
472
+ FUKS010111,8.82,3.71,6.77,6.38,0.90,3.89,4.05,9.11,1.77,5.05,6.54,5.45,1.62,3.51,4.28,7.64,7.12,1.96,4.85,6.60
473
+ FUKS010112,6.65,5.17,4.40,5.50,1.79,4.52,6.89,5.72,2.13,5.47,10.15,7.59,2.24,4.34,4.56,6.52,5.08,1.24,3.01,7.00
474
+ AVBF000101,0.163,0.220,0.124,0.212,0.316,0.274,0.212,0.080,0.315,0.474,0.315,0.255,0.356,0.410,NA,0.290,0.412,0.325,0.354,0.515
475
+ AVBF000102,0.236,0.233,0.189,0.168,0.259,0.314,0.306,-0.170,0.256,0.391,0.293,0.231,0.367,0.328,NA,0.202,0.308,0.197,0.223,0.436
476
+ AVBF000103,-0.490,-0.429,-0.387,-0.375,-0.352,-0.422,-0.382,-0.647,-0.357,-0.268,-0.450,-0.409,-0.375,-0.309,NA,-0.426,-0.240,-0.325,-0.288,-0.220
477
+ AVBF000104,-0.871,-0.727,-0.741,-0.737,-0.666,-0.728,-0.773,-0.822,-0.685,-0.617,-0.798,-0.715,-0.717,-0.649,NA,-0.679,-0.629,-0.669,-0.655,-0.599
478
+ AVBF000105,-0.393,-0.317,-0.268,-0.247,-0.222,-0.291,-0.260,-0.570,-0.244,-0.144,-0.281,-0.294,-0.274,-0.189,NA,-0.280,-0.152,-0.206,-0.155,-0.080
479
+ AVBF000106,-0.378,-0.369,-0.245,-0.113,-0.206,-0.290,-0.165,-0.560,-0.295,-0.134,-0.266,-0.335,-0.260,-0.187,NA,-0.251,-0.093,-0.188,-0.147,-0.084
480
+ AVBF000107,-0.729,-0.535,-0.597,-0.545,-0.408,-0.492,-0.532,-0.860,-0.519,-0.361,-0.462,-0.508,-0.518,-0.454,NA,-0.278,-0.367,-0.455,-0.439,-0.323
481
+ AVBF000108,-0.623,-0.567,-0.619,-0.626,-0.571,-0.559,-0.572,-0.679,-0.508,-0.199,-0.527,-0.581,-0.571,-0.461,NA,-0.458,-0.233,-0.327,-0.451,-0.263
482
+ AVBF000109,-0.376,-0.280,-0.403,-0.405,-0.441,-0.362,-0.362,-0.392,-0.345,-0.194,-0.317,-0.412,-0.312,-0.237,NA,-0.374,-0.243,-0.111,-0.171,-0.355
483
+ YANJ020101,NA,0.62,0.76,0.66,0.83,0.59,0.73,NA,0.92,0.88,0.89,0.77,0.77,0.92,0.94,0.58,0.73,0.86,0.93,0.88
484
+ MITS020101,0,2.45,0,0,0,1.25,1.27,0,1.45,0,0,3.67,0,0,0,0,0,6.93,5.06,0
485
+ TSAJ990101,89.3,190.3,122.4,114.4,102.5,146.9,138.8,63.8,157.5,163.0,163.1,165.1,165.8,190.8,121.6,94.2,119.6,226.4,194.6,138.2
486
+ TSAJ990102,90.0,194.0,124.7,117.3,103.3,149.4,142.2,64.9,160.0,163.9,164.0,167.3,167.0,191.9,122.9,95.4,121.5,228.2,197.0,139.0
487
+ COSI940101,0.0373,0.0959,0.0036,0.1263,0.0829,0.0761,0.0058,0.0050,0.0242,0.0000,0.0000,0.0371,0.0823,0.0946,0.0198,0.0829,0.0941,0.0548,0.0516,0.0057
488
+ PONP930101,0.85,0.20,-0.48,-1.10,2.10,-0.42,-0.79,0,0.22,3.14,1.99,-1.19,1.42,1.69,-1.14,-0.52,-0.08,1.76,1.37,2.53
489
+ WILM950101,0.06,-0.85,0.25,-0.20,0.49,0.31,-0.10,0.21,-2.24,3.48,3.50,-1.62,0.21,4.80,0.71,-0.62,0.65,2.29,1.89,1.59
490
+ WILM950102,2.62,1.26,-1.27,-2.84,0.73,-1.69,-0.45,-1.15,-0.74,4.38,6.57,-2.78,-3.12,9.14,-0.12,-1.39,1.81,5.91,1.39,2.30
491
+ WILM950103,-1.64,-3.28,0.83,0.70,9.30,-0.04,1.18,-1.85,7.17,3.02,0.83,-2.36,4.26,-1.36,3.12,1.59,2.31,2.61,2.37,0.52
492
+ WILM950104,-2.34,1.60,2.81,-0.48,5.03,0.16,1.30,-1.06,-3.00,7.26,1.09,1.56,0.62,2.57,-0.15,1.93,0.19,3.59,-2.58,2.06
493
+ KUHL950101,0.78,1.58,1.20,1.35,0.55,1.19,1.45,0.68,0.99,0.47,0.56,1.10,0.66,0.47,0.69,1.00,1.05,0.70,1.00,0.51
494
+ GUOD860101,25,-7,-7,2,32,0,14,-2,-26,91,100,-26,68,100,25,-2,7,109,56,62
495
+ JURD980101,1.10,-5.10,-3.50,-3.60,2.50,-3.68,-3.20,-0.64,-3.20,4.50,3.80,-4.11,1.90,2.80,-1.90,-0.50,-0.70,-0.46,-1.3,4.2
496
+ BASU050101,0.1366,0.0363,-0.0345,-0.1233,0.2745,0.0325,-0.0484,-0.0464,0.0549,0.4172,0.4251,-0.0101,0.1747,0.4076,0.0019,-0.0433,0.0589,0.2362,0.3167,0.4084
497
+ BASU050102,0.0728,0.0394,-0.0390,-0.0552,0.3557,0.0126,-0.0295,-0.0589,0.0874,0.3805,0.3819,-0.0053,0.1613,0.4201,-0.0492,-0.0282,0.0239,0.4114,0.3113,0.2947
498
+ BASU050103,0.1510,-0.0103,0.0381,0.0047,0.3222,0.0246,-0.0639,0.0248,0.1335,0.4238,0.3926,-0.0158,0.2160,0.3455,0.0844,0.0040,0.1462,0.2657,0.2998,0.3997
499
+ SUYM030101,-0.058,0.000,0.027,0.016,0.447,-0.073,-0.128,0.331,0.195,0.060,0.138,-0.112,0.275,0.240,-0.478,-0.177,-0.163,0.564,0.322,-0.052
500
+ PUNT030101,-0.17,0.37,0.18,0.37,-0.06,0.26,0.15,0.01,-0.02,-0.28,-0.28,0.32,-0.26,-0.41,0.13,0.05,0.02,-0.15,-0.09,-0.17
501
+ PUNT030102,-0.15,0.32,0.22,0.41,-0.15,0.03,0.30,0.08,0.06,-0.29,-0.36,0.24,-0.19,-0.22,0.15,0.16,-0.08,-0.28,-0.03,-0.24
502
+ GEOR030101,0.964,1.143,0.944,0.916,0.778,1.047,1.051,0.835,1.014,0.922,1.085,0.944,1.032,1.119,1.299,0.947,1.017,0.895,1,0.955
503
+ GEOR030102,0.974,1.129,0.988,0.892,0.972,1.092,1.054,0.845,0.949,0.928,1.11,0.946,0.923,1.122,1.362,0.932,1.023,0.879,0.902,0.923
504
+ GEOR030103,0.938,1.137,0.902,0.857,0.6856,0.916,1.139,0.892,1.109,0.986,1,0.952,1.077,1.11,1.266,0.956,1.018,0.971,1.157,0.959
505
+ GEOR030104,1.042,1.069,0.828,0.97,0.5,1.111,0.992,0.743,1.034,0.852,1.193,0.979,0.998,0.981,1.332,0.984,0.992,0.96,1.12,1.001
506
+ GEOR030105,1.065,1.131,0.762,0.836,1.015,0.861,0.736,1.022,0.973,1.189,1.192,0.478,1.369,1.368,1.241,1.097,0.822,1.017,0.836,1.14
507
+ GEOR030106,0.99,1.132,0.873,0.915,0.644,0.999,1.053,0.785,1.054,0.95,1.106,1.003,1.093,1.121,1.314,0.911,0.988,0.939,1.09,0.957
508
+ GEOR030107,0.892,1.154,1.144,0.925,1.035,1.2,1.115,0.917,0.992,0.817,0.994,0.944,0.782,1.058,1.309,0.986,1.11,0.841,0.866,0.9
509
+ GEOR030108,1.092,1.239,0.927,0.919,0.662,1.124,1.199,0.698,1.012,0.912,1.276,1.008,1.171,1.09,0.8,0.886,0.832,0.981,1.075,0.908
510
+ GEOR030109,0.843,1.038,0.956,0.906,0.896,0.968,0.9,0.978,1.05,0.946,0.885,0.893,0.878,1.151,1.816,1.003,1.189,0.852,0.945,0.999
511
+ ZHOH040101,2.18,2.71,1.85,1.75,3.89,2.16,1.89,1.17,2.51,4.50,4.71,2.12,3.63,5.88,2.09,1.66,2.18,6.46,5.01,3.77
512
+ ZHOH040102,1.79,3.20,2.83,2.33,2.22,2.37,2.52,0.70,3.06,4.59,4.72,2.50,3.91,4.84,2.45,1.82,2.45,5.64,4.46,3.67
513
+ ZHOH040103,13.4,8.5,7.6,8.2,22.6,8.5,7.3,7.0,11.3,20.3,20.8,6.1,15.7,23.9,9.9,8.2,10.3,24.5,19.5,19.5
514
+ BAEK050101,0.0166,-0.0762,-0.0786,-0.1278,0.5724,-0.1051,-0.1794,-0.0442,0.1643,0.2758,0.2523,-0.2134,0.0197,0.3561,-0.4188,-0.1629,-0.0701,0.3836,0.2500,0.1782
515
+ HARY940101,90.1,192.8,127.5,117.1,113.2,149.4,140.8,63.8,159.3,164.9,164.6,170.0,167.7,193.5,123.1,94.2,120.0,197.1,231.7,139.1
516
+ PONJ960101,91.5,196.1,138.3,135.2,114.4,156.4,154.6,67.5,163.2,162.6,163.4,162.5,165.9,198.8,123.4,102.0,126.0,209.8,237.2,138.4
517
+ DIGM050101,1.076,1.361,1.056,1.290,0.753,0.729,1.118,1.346,0.985,0.926,1.054,1.105,0.974,0.869,0.820,1.342,0.871,0.666,0.531,1.131
518
+ WOLR790101,1.12,-2.55,-0.83,-0.83,0.59,-0.78,-0.92,1.20,-0.93,1.16,1.18,-0.80,0.55,0.67,0.54,-0.05,-0.02,-0.19,-0.23,1.13
519
+ OLSK800101,1.38,0.00,0.37,0.52,1.43,0.22,0.71,1.34,0.66,2.32,1.47,0.15,1.78,1.72,0.85,0.86,0.89,0.82,0.47,1.99
520
+ KIDA850101,-0.27,1.87,0.81,0.81,-1.05,1.10,1.17,-0.16,0.28,-0.77,-1.10,1.70,-0.73,-1.43,-0.75,0.42,0.63,-1.57,-0.56,-0.40
521
+ GUYH850102,0.05,0.12,0.29,0.41,-0.84,0.46,0.38,0.31,-0.41,-0.69,-0.62,0.57,-0.38,-0.45,0.46,0.12,0.38,-0.98,-0.25,-0.46
522
+ GUYH850103,0.54,-0.16,0.38,0.65,-1.13,0.05,0.38,NA,-0.59,-2.15,-1.08,0.48,-0.97,-1.51,-0.22,0.65,0.27,-1.61,-1.13,-0.75
523
+ GUYH850104,-0.31,1.30,0.49,0.58,-0.87,0.70,0.68,-0.33,0.13,-0.66,-0.53,1.79,-0.38,-0.45,0.34,0.10,0.21,-0.27,0.40,-0.62
524
+ GUYH850105,-0.27,2.00,0.61,0.50,-0.23,1.00,0.33,-0.22,0.37,-0.80,-0.44,1.17,-0.31,-0.55,0.36,0.17,0.18,0.05,0.48,-0.65
525
+ ROSM880104,0.39,NA,-1.91,-0.71,0.25,-1.30,-0.18,0.00,-0.60,1.82,1.82,0.32,0.96,2.27,NA,-1.24,-1.00,2.13,1.47,1.30
526
+ ROSM880105,0.39,-3.95,-1.91,-3.81,0.25,-1.30,-2.91,0.00,-0.64,1.82,1.82,-2.77,0.96,2.27,NA,-1.24,-1.00,2.13,1.47,1.30
527
+ JACR890101,0.18,-5.40,-1.30,-2.36,0.27,-1.22,-2.10,0.09,-1.48,0.37,0.41,-2.53,0.44,0.50,-0.20,-0.40,-0.34,-0.01,-0.08,0.32
528
+ COWR900101,0.42,-1.56,-1.03,-0.51,0.84,-0.96,-0.37,0.00,-2.28,1.81,1.80,-2.03,1.18,1.74,0.86,-0.64,-0.26,1.46,0.51,1.34
529
+ BLAS910101,0.616,0.000,0.236,0.028,0.680,0.251,0.043,0.501,0.165,0.943,0.943,0.283,0.738,1.000,0.711,0.359,0.450,0.878,0.880,0.825
530
+ CASG920101,0.2,-0.7,-0.5,-1.4,1.9,-1.1,-1.3,-0.1,0.4,1.4,0.5,-1.6,0.5,1.0,-1.0,-0.7,-0.4,1.6,0.5,0.7
531
+ CORJ870101,50.76,48.66,45.80,43.17,58.74,46.09,43.48,50.27,49.33,57.30,53.89,42.92,52.75,53.45,45.39,47.24,49.26,53.59,51.79,56.12
532
+ CORJ870102,-0.414,-0.584,-0.916,-1.310,0.162,-0.905,-1.218,-0.684,-0.630,1.237,1.215,-0.670,1.020,1.938,-0.503,-0.563,-0.289,0.514,1.699,0.899
533
+ CORJ870103,-0.96,0.75,-1.94,-5.68,4.54,-5.30,-3.86,-1.28,-0.62,5.54,6.81,-5.62,4.76,5.06,-4.47,-1.92,-3.99,0.21,3.34,5.39
534
+ CORJ870104,-0.26,0.08,-0.46,-1.30,0.83,-0.83,-0.73,-0.40,-0.18,1.10,1.52,-1.01,1.09,1.09,-0.62,-0.55,-0.71,-0.13,0.69,1.15
535
+ CORJ870105,-0.73,-1.03,-5.29,-6.13,0.64,-0.96,-2.90,-2.67,3.03,5.04,4.91,-5.99,3.34,5.20,-4.32,-3.00,-1.91,0.51,2.87,3.98
536
+ CORJ870106,-1.35,-3.89,-10.96,-11.88,4.37,-1.34,-4.56,-5.82,6.54,10.93,9.88,-11.92,7.47,11.35,-10.86,-6.21,-4.83,1.80,7.61,8.20
537
+ CORJ870107,-0.56,-0.26,-2.87,-4.31,1.78,-2.31,-2.35,-1.35,0.81,3.83,4.09,-4.08,3.11,3.67,-3.22,-1.85,-1.97,-0.11,2.17,3.31
538
+ CORJ870108,1.37,1.33,6.29,8.93,-4.47,3.88,4.04,3.39,-1.65,-7.92,-8.68,7.70,-7.13,-7.96,6.25,4.08,4.02,0.79,-4.73,-6.94
539
+ MIYS990101,-0.02,0.44,0.63,0.72,-0.96,0.56,0.74,0.38,0.00,-1.89,-2.29,1.01,-1.36,-2.22,0.47,0.55,0.25,-1.28,-0.88,-1.34
540
+ MIYS990102,0.00,0.07,0.10,0.12,-0.16,0.09,0.12,0.06,0.00,-0.31,-0.37,0.17,-0.22,-0.36,0.08,0.09,0.04,-0.21,-0.14,-0.22
541
+ MIYS990103,-0.03,0.09,0.13,0.17,-0.36,0.13,0.23,0.09,-0.04,-0.33,-0.38,0.32,-0.30,-0.34,0.20,0.10,0.01,-0.24,-0.23,-0.29
542
+ MIYS990104,-0.04,0.07,0.13,0.19,-0.38,0.14,0.23,0.09,-0.04,-0.34,-0.37,0.33,-0.30,-0.38,0.19,0.12,0.03,-0.33,-0.29,-0.29
543
+ MIYS990105,-0.02,0.08,0.10,0.19,-0.32,0.15,0.21,-0.02,-0.02,-0.28,-0.32,0.30,-0.25,-0.33,0.11,0.11,0.05,-0.27,-0.23,-0.23
544
+ ENGD860101,-1.6,12.3,4.8,9.2,-2.0,4.1,8.2,-1.0,3.0,-3.1,-2.8,8.8,-3.4,-3.7,0.2,-0.6,-1.2,-1.9,0.7,-2.6
545
+ FASG890101,-0.21,2.11,0.96,1.36,-6.04,1.52,2.30,0.00,-1.23,-4.81,-4.68,3.88,-3.66,-4.65,0.75,1.74,0.78,-3.32,-1.01,-3.50
546
+ KARS160101,2.00,8.00,5.00,5.00,3.00,6.00,6.00,1.00,7.00,5.00,5.00,6.00,5.00,8.00,4.00,3.00,4.00,11.00,9.00,4.00
547
+ KARS160102,1.00,7.00,4.00,4.00,2.00,5.00,5.00,0.00,6.00,4.00,4.00,5.00,4.00,8.00,4.00,2.00,3.00,12.00,9.00,3.00
548
+ KARS160103,2.00,12.00,8.00,8.00,4.00,10.00,10.00,0.00,14.00,8.00,8.00,10.00,8.00,14.00,8.00,4.00,6.00,24.00,18.00,6.00
549
+ KARS160104,1.00,6.00,4.00,4.00,2.00,4.00,5.00,1.00,6.000,4.00,4.00,4.00,4.00,6.00,4.00,2.00,3.00,8.00,7.00,3.00
550
+ KARS160105,1.00,8.120,5.00,5.17,2.33,5.860,6.00,0.00,6.71,3.25,5.00,7.00,5.40,7.00,4.00,1.670,3.250,11.10,8.88,3.25
551
+ KARS160106,1.00,6.00,3.00,3.00,1.00,4.00,4.00,0.00,6.000,3.00,3.00,5.00,3.00,6.000,4.00,2.00,1.00,9.000,6.000,1.00
552
+ KARS160107,1.00,12.00,6.00,6.00,3.00,8.00,8.00,0.00,9.00,6.00,6.00,9.00,7.00,11.000,4.000,3.00,4.00,14.000,13.000,4.00
553
+ KARS160108,1.00,1.50,1.60,1.60,1.333,1.667,1.667,0.00,2.00,1.600,1.60,1.667,1.60,1.750,2.00,1.333,1.50,2.182,2.000,1.50
554
+ KARS160109,2.00,12.499,11.539,11.539,6.243,12.207,11.530,0.00,12.876,10.851,11.029,10.363,9.49,14.851,12.00,5.00,9.928,13.511,12.868,9.928
555
+ KARS160110,0.00,-4.307,-4.178,-4.178,-2.243,-4.255,-3.425,0.00,-3.721,-6.085,-4.729,-3.151,-2.812,-4.801,-4.00,1.00,-3.928,-6.324,-4.793,-3.928
556
+ KARS160111,1.00,3.500,3.20,3.20,2.00,3.333,3.333,0.00,4.286,1.80,3.20,3.00,2.80,4.25,4.00,2.00,3.00,4.00,4.333,3.00
557
+ KARS160112,2.00,-2.590,0.528,0.528,2.00,-1.043,-0.538,0.00,-1.185,-1.517,1.052,-0.536,0.678,-1.672,4.00,2.00,3.00,-2.576,-2.054,3.00
558
+ KARS160113,6.00,19.00,12.00,12.00,6.00,12.00,12.00,1.00,15.00,12.00,12.00,12.00,18.00,18.00,12.00,6.00,6.00,24.00,18.00,6.00
559
+ KARS160114,6.00,31.444,16.50,16.40,16.670,21.167,21.00,3.50,23.10,15.60,15.60,24.50,27.20,23.25,12.00,13.33,12.40,27.50,27.78,10.50
560
+ KARS160115,6.00,20.00,14.00,12.00,12.00,15.00,14.00,1.00,18.00,12.00,12.00,18.00,18.00,18.00,12.00,8.00,8.00,18.00,20.00,6.00
561
+ KARS160116,6.00,38.00,20.00,20.00,22.00,24.00,26.00,6.00,31.00,18.00,18.00,31.00,34.00,24.00,12.00,20.00,14.00,36.00,38.00,12.00
562
+ KARS160117,12.00,45.00,33.007,34.00,28.00,39.00,40.00,7.00,47.00,30.00,30.00,37.00,40.00,48.00,24.00,22.00,27.00,68.00,56.00,24.007
563
+ KARS160118,6.00,5.00,6.60,6.80,9.33,6.50,6.67,3.50,4.70,6.00,6.00,6.17,8.00,6.00,6.00,7.33,5.40,5.667,6.22,6.00
564
+ KARS160119,12.00,23.343,27.708,28.634,28.00,27.831,28.731,7.00,24.243,24.841,25.021,22.739,31.344,26.993,24.00,20.00,23.819,29.778,28.252,24.00
565
+ KARS160120,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,-1.734,-1.641,0.00,-0.179,0.00,0.00,0.00,0.00,-4.227,0.211,-0.96,0.00
566
+ KARS160121,6.00,10.667,10.00,10.40,11.333,10.50,10.667,3.50,10.400,9.60,9.60,10.167,13.60,12.00,12.00,8.667,9.00,12.75,12.222,9.00
567
+ KARS160122,0.00,4.20,3.00,2.969,6.00,1.849,1.822,0.00,1.605,3.373,3.113,1.372,2.656,2.026,12.00,6.00,6.00,2.044,1.599,6.00
apex/best_key_list ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ 3&128&2048&1e-05&0.1&1.0
2
+ 3&256&2048&1e-06&0.1&1.0
3
+ 2&128&512&1e-05&0.01&1.0
4
+ 3&128&512&1e-05&0.001&1.0
5
+ 2&128&2048&1e-06&0.0&1.0
6
+ 3&256&512&1e-06&0.0&1.0
7
+ 2&128&2048&1e-05&0.01&1.0
8
+ 2&256&2048&1e-06&0.1&1.0
apex/predict.py ADDED
@@ -0,0 +1,129 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import json
3
+ #from time import perf_counter
4
+ import numpy as np
5
+ import matplotlib.pyplot as plt
6
+ import torch
7
+ import torch.nn as nn
8
+ import torch.nn.functional as F
9
+ import torch.optim as optim
10
+ import math, copy, time
11
+ from torch.autograd import Variable
12
+ from scipy import stats
13
+ import pandas as pd
14
+ from sklearn.model_selection import KFold
15
+ import pickle
16
+ from sklearn.model_selection import train_test_split
17
+ from torch.optim.lr_scheduler import StepLR
18
+ import os.path
19
+ from Bio import SeqIO
20
+ import string
21
+ import glob
22
+ from sklearn.preprocessing import StandardScaler
23
+ from sklearn.linear_model import ElasticNet
24
+ from sklearn.svm import SVR
25
+ from sklearn.ensemble import RandomForestRegressor
26
+ from sklearn.model_selection import KFold, StratifiedKFold
27
+ from sklearn.metrics import roc_auc_score, average_precision_score
28
+ from sklearn.ensemble import RandomForestClassifier
29
+ from AMP_DL_model_twohead import AMP_model
30
+ #from propy.AAComposition import CalculateAADipeptideComposition
31
+ from rdkit import Chem
32
+ from rdkit.Chem import AllChem
33
+ from scipy import stats
34
+ from utils import *
35
+ from scipy import sparse
36
+ import sys
37
+ from optparse import OptionParser
38
+ import copy
39
+ import pandas as pd
40
+
41
+
42
+ col = ['E. coli ATCC11775', 'P. aeruginosa PAO1', 'P. aeruginosa PA14', 'S. aureus ATCC12600', 'E. coli AIG221', 'E. coli AIG222', 'K. pneumoniae ATCC13883', 'A. baumannii ATCC19606', 'A. muciniphila ATCC BAA-835', 'B. fragilis ATCC25285', 'B. vulgatus ATCC8482', 'C. aerofaciens ATCC25986', 'C. scindens ATCC35704', 'B. thetaiotaomicron ATCC29148', 'B. thetaiotaomicron Complemmented', 'B. thetaiotaomicron Mutant', 'B. uniformis ATCC8492', 'B. eggerthi ATCC27754', 'C. spiroforme ATCC29900', 'P. distasonis ATCC8503', 'P. copri DSMZ18205', 'B. ovatus ATCC8483', 'E. rectale ATCC33656', 'C. symbiosum', 'R. obeum', 'R. torques', 'S. aureus (ATCC BAA-1556) - MRSA', 'vancomycin-resistant E. faecalis ATCC700802', 'vancomycin-resistant E. faecium ATCC700221', 'E. coli Nissle', 'Salmonella enterica ATCC 9150 (BEIRES NR-515)', 'Salmonella enterica (BEIRES NR-170)', 'Salmonella enterica ATCC 9150 (BEIRES NR-174)', 'L. monocytogenes ATCC 19111 (BEIRES NR-106)']
43
+
44
+ max_len = 52 # maximun peptide length
45
+
46
+ word2idx, idx2word = make_vocab()
47
+ emb, AAindex_dict = AAindex('./aaindex1.csv', word2idx)
48
+ vocab_size = len(word2idx)
49
+ emb_size = np.shape(emb)[1]
50
+
51
+
52
+ model_num = 8
53
+ repeat_num = 5
54
+
55
+
56
+ f = open('./best_key_list', 'r')
57
+ lines = f.readlines()
58
+ f.close()
59
+
60
+ model_list = []
61
+ for line in lines:
62
+ parsed = line.strip('\n').strip('\r')
63
+ model_list.append(parsed)
64
+
65
+
66
+ all_list = []
67
+ ensemble_num = model_num * repeat_num
68
+
69
+ deep_model_list = []
70
+ for a_model_name in model_list:
71
+ for a_en in range(repeat_num):
72
+ key = 'trained_all_model_'+a_model_name+'_ensemble_'+str(a_en)
73
+
74
+ model = torch.load('./trained_models/'+key)
75
+ model.eval()
76
+ deep_model_list.append(model)
77
+
78
+
79
+
80
+
81
+
82
+
83
+ seq_list = []
84
+ f = open('./test_seqs.txt', 'r')
85
+ lines = f.readlines()
86
+ f.close()
87
+
88
+ for line in lines:
89
+ seq_list.append(line.strip('\n').strip('\r'))
90
+
91
+ seq_list = np.array(seq_list)
92
+
93
+ ensemble_counter = 0
94
+ for ensemble_id in range(ensemble_num):
95
+
96
+ AMP_model = deep_model_list[ensemble_id].cuda().eval()
97
+
98
+ data_len = len(seq_list)
99
+ batch_size = 3000 #change according to your GPU memory
100
+ for i in range(int(math.ceil(data_len/float(batch_size)))):
101
+ if (i*batch_size) % 1000 == 0:
102
+ print ('progress', i*batch_size, data_len)
103
+
104
+ seq_batch = seq_list[i*batch_size:(i+1)*batch_size]
105
+ seq_rep, _, _ = onehot_encoding(seq_batch, max_len, word2idx)
106
+
107
+ X_seq = torch.LongTensor(seq_rep).cuda()
108
+
109
+
110
+ AMP_pred_batch = AMP_model(X_seq).cpu().detach().numpy()
111
+ AMP_pred_batch = 10**(6-AMP_pred_batch) #transform back to MICs
112
+
113
+ if i == 0:
114
+ AMP_pred = AMP_pred_batch
115
+ else:
116
+ AMP_pred = np.vstack([AMP_pred, AMP_pred_batch])
117
+
118
+ if ensemble_id == 0:
119
+ AMP_sum = AMP_pred
120
+ else:
121
+ AMP_sum += AMP_pred
122
+ ensemble_counter += 1
123
+
124
+ AMP_pred = AMP_sum / float(ensemble_counter)
125
+
126
+ df = pd.DataFrame(data=AMP_pred, columns=col, index=seq_list)
127
+ print (df)
128
+
129
+ df.to_csv('Predicted_MICs.csv')
apex/requirement.txt ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ numpy==1.23
2
+ scipy==1.10
3
+ matplotlib==3.9.4
4
+ pandas==2.2.3
5
+ scikit-learn==1.6.1
6
+ biopython==1.85
7
+ rdkit==2024.3.2
apex/test_seqs.txt ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ IPKTYDKRWDDQCWLAITGRYHGITTPPCCSWVV
2
+ KWLIYYNEGHLMVKYMLTISVRIPEGDNPNIQLHGSIGSR
3
+ VGHAQVASPDLHWDGHGNHLIPWTPCYSHEMNPTMPPA
4
+ RIWETQGSDCIRDGIDSTGPPFMVMFHAAGWRQVHSK
5
+ IYEDYEFVRMPTHMTDFMQSPDQQNPKHMWTLCFDHT
6
+ CPWVQHFWAPPWAHCICIEGPEESGWATIEPMVVGT
7
+ FPLTMHGEFSQNLVWTITQHLVKRWCYTLSPKFCHRY
8
+ SRSEDQILATYWRTSTCYFNQLWFQRLTGQQRICC
9
+ QLELPCCIETWKLNVAFRCPFHKDLKRLGLYSRDKW
10
+ PPMDCVYAIKTTSDHQSTMFIIPRYTHMYGNLQLWCVYCT
apex/utils.py ADDED
@@ -0,0 +1,126 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import json
3
+ import csv
4
+ import numpy as np
5
+ import matplotlib.pyplot as plt
6
+ import torch
7
+ import torch.nn as nn
8
+ import torch.nn.functional as F
9
+ import torch.optim as optim
10
+ import math, copy, time
11
+ from torch.autograd import Variable
12
+ from scipy import stats
13
+ import pandas as pd
14
+ from sklearn.model_selection import KFold
15
+ import pickle
16
+ from sklearn.model_selection import train_test_split
17
+ import os.path
18
+
19
+ def make_vocab():
20
+ #0: pad
21
+ #1: start
22
+ #2: end
23
+
24
+ word2idx = {}
25
+ idx2word = {}
26
+
27
+ word2idx['0'] = 0
28
+ word2idx['1'] = 1
29
+ word2idx['2'] = 2
30
+
31
+ word2idx['A'] = 3
32
+ word2idx['C'] = 4
33
+ word2idx['D'] = 5
34
+ word2idx['E'] = 6
35
+ word2idx['F'] = 7
36
+ word2idx['G'] = 8
37
+ word2idx['H'] = 9
38
+ word2idx['I'] = 10
39
+ word2idx['K'] = 11
40
+ word2idx['L'] = 12
41
+ word2idx['M'] = 13
42
+ word2idx['N'] = 14
43
+ word2idx['P'] = 15
44
+ word2idx['Q'] = 16
45
+ word2idx['R'] = 17
46
+ word2idx['S'] = 18
47
+ word2idx['T'] = 19
48
+ word2idx['V'] = 20
49
+ word2idx['W'] = 21
50
+ word2idx['Y'] = 22
51
+
52
+ for key, value in word2idx.items():
53
+ idx2word[value] = key
54
+
55
+ return word2idx, idx2word
56
+
57
+
58
+ def AAindex(path, word2idx):
59
+ with open(path) as csvfile:
60
+ reader = csv.reader(csvfile)
61
+ AAindex_dict = {}
62
+ AAindex_matrix = []
63
+ skip = 1
64
+ for row in reader:
65
+ if skip == 1:
66
+ skip = 0
67
+ header = np.array(row)[1:].tolist()
68
+ continue
69
+ tmp = []
70
+ for j in np.array(row)[1:]:
71
+ try:
72
+ tmp.append(float(j))
73
+ except:
74
+ tmp.append(0)
75
+ AAindex_matrix.append(np.array(tmp))
76
+
77
+ dim = np.shape(AAindex_matrix)[0]
78
+ AAindex_matrix = np.array(AAindex_matrix)
79
+ for i in range(len(header)):
80
+ AAindex_dict[header[i]] = AAindex_matrix[:, i]
81
+
82
+ #print (AAindex_matrix)
83
+ emb = np.zeros((len(word2idx), dim))
84
+ for key, value in word2idx.items():
85
+ if key in AAindex_dict:
86
+ emb[value] = AAindex_dict[key]
87
+ else:
88
+ pass
89
+ return emb, AAindex_dict
90
+
91
+
92
+
93
+ def onehot_encoding(seq_list_, max_len, word2idx):
94
+ #0: pad
95
+ #1: start
96
+ #2: end
97
+ seq_list = [i for i in seq_list_]
98
+ X = np.zeros((len(seq_list), max_len)).astype(int)
99
+
100
+ AA_mask = []
101
+ nonAA_mask = []
102
+
103
+ for i in range(len(seq_list)):
104
+ if len(seq_list[i]) >= max_len - 2:
105
+ a_seq = '1' + seq_list[i][:max_len-2].upper() + '2'
106
+ else:
107
+ a_seq = '1' + seq_list[i].upper() + '2'
108
+
109
+ if len(a_seq) > max_len:
110
+ iter_num = max_len
111
+ else:
112
+ iter_num = len(a_seq)
113
+
114
+ for j in range(iter_num):
115
+ if a_seq[j] not in word2idx:
116
+ continue
117
+ else:
118
+ X[i,j] = word2idx[a_seq[j]]
119
+
120
+ tmp = np.zeros(max_len)
121
+ tmp[1:iter_num+1] = 1
122
+ AA_mask.append(tmp.astype(int))
123
+ nonAA_mask.append((1-tmp).astype(int))
124
+
125
+
126
+ return np.array(X), np.array(AA_mask), np.array(nonAA_mask)
cfg_dataset.py ADDED
@@ -0,0 +1,324 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import torch
2
+ import torch.nn as nn
3
+ from torch.utils.data import Dataset, DataLoader
4
+ import numpy as np
5
+ import json
6
+ import os
7
+ from typing import Dict, List, Tuple, Optional
8
+ import random
9
+
10
+ class CFGUniProtDataset(Dataset):
11
+ """
12
+ Dataset class for UniProt sequences with classifier-free guidance.
13
+
14
+ This dataset:
15
+ 1. Loads processed UniProt data with AMP classifications
16
+ 2. Handles label masking for CFG training
17
+ 3. Integrates with your existing flow training pipeline
18
+ 4. Provides sequences, labels, and masking information
19
+ """
20
+
21
+ def __init__(self,
22
+ data_path: str,
23
+ use_masked_labels: bool = True,
24
+ mask_probability: float = 0.1,
25
+ max_seq_len: int = 50,
26
+ device: str = 'cuda'):
27
+
28
+ self.data_path = data_path
29
+ self.use_masked_labels = use_masked_labels
30
+ self.mask_probability = mask_probability
31
+ self.max_seq_len = max_seq_len
32
+ self.device = device
33
+
34
+ # Load processed data
35
+ self._load_data()
36
+
37
+ # Label mapping
38
+ self.label_map = {
39
+ 0: 'amp', # MIC < 100
40
+ 1: 'non_amp', # MIC > 100
41
+ 2: 'mask' # Unknown MIC
42
+ }
43
+
44
+ print(f"CFG Dataset initialized:")
45
+ print(f" Total sequences: {len(self.sequences)}")
46
+ print(f" Using masked labels: {use_masked_labels}")
47
+ print(f" Mask probability: {mask_probability}")
48
+ print(f" Label distribution: {self._get_label_distribution()}")
49
+
50
+ def _load_data(self):
51
+ """Load processed UniProt data."""
52
+ if os.path.exists(self.data_path):
53
+ with open(self.data_path, 'r') as f:
54
+ data = json.load(f)
55
+
56
+ self.sequences = data['sequences']
57
+ self.original_labels = np.array(data['original_labels'])
58
+ self.masked_labels = np.array(data['masked_labels'])
59
+ self.mask_indices = set(data['mask_indices'])
60
+
61
+ else:
62
+ raise FileNotFoundError(f"Data file not found: {self.data_path}")
63
+
64
+ def _get_label_distribution(self) -> Dict[str, int]:
65
+ """Get distribution of labels in the dataset."""
66
+ labels = self.masked_labels if self.use_masked_labels else self.original_labels
67
+ unique, counts = np.unique(labels, return_counts=True)
68
+ return {self.label_map[label]: count for label, count in zip(unique, counts)}
69
+
70
+ def __len__(self) -> int:
71
+ return len(self.sequences)
72
+
73
+ def __getitem__(self, idx: int) -> Dict[str, torch.Tensor]:
74
+ """Get a single sample with sequence and label."""
75
+ sequence = self.sequences[idx]
76
+
77
+ # Get appropriate label
78
+ if self.use_masked_labels:
79
+ label = self.masked_labels[idx]
80
+ else:
81
+ label = self.original_labels[idx]
82
+
83
+ # Check if this sample was masked
84
+ is_masked = idx in self.mask_indices
85
+
86
+ return {
87
+ 'sequence': sequence,
88
+ 'label': torch.tensor(label, dtype=torch.long),
89
+ 'original_label': torch.tensor(self.original_labels[idx], dtype=torch.long),
90
+ 'is_masked': torch.tensor(is_masked, dtype=torch.bool),
91
+ 'index': torch.tensor(idx, dtype=torch.long)
92
+ }
93
+
94
+ def get_label_statistics(self) -> Dict[str, Dict]:
95
+ """Get detailed statistics about labels."""
96
+ stats = {
97
+ 'original': self._get_label_distribution(),
98
+ 'masked': self._get_label_distribution() if self.use_masked_labels else None,
99
+ 'masking_info': {
100
+ 'total_masked': len(self.mask_indices),
101
+ 'mask_probability': self.mask_probability,
102
+ 'masked_indices': list(self.mask_indices)
103
+ }
104
+ }
105
+ return stats
106
+
107
+ class CFGFlowDataset(Dataset):
108
+ """
109
+ Dataset that integrates CFG labels with your existing flow training pipeline.
110
+
111
+ This dataset:
112
+ 1. Loads your existing AMP embeddings
113
+ 2. Adds CFG labels from UniProt processing
114
+ 3. Handles the integration between embeddings and labels
115
+ 4. Provides data in the format expected by your flow training
116
+ """
117
+
118
+ def __init__(self,
119
+ embeddings_path: str,
120
+ cfg_data_path: str,
121
+ use_masked_labels: bool = True,
122
+ max_seq_len: int = 50,
123
+ device: str = 'cuda'):
124
+
125
+ self.embeddings_path = embeddings_path
126
+ self.cfg_data_path = cfg_data_path
127
+ self.use_masked_labels = use_masked_labels
128
+ self.max_seq_len = max_seq_len
129
+ self.device = device
130
+
131
+ # Load data
132
+ self._load_embeddings()
133
+ self._load_cfg_data()
134
+ self._align_data()
135
+
136
+ print(f"CFG Flow Dataset initialized:")
137
+ print(f" AMP embeddings: {self.embeddings.shape}")
138
+ print(f" CFG labels: {len(self.cfg_labels)}")
139
+ print(f" Aligned samples: {len(self.aligned_indices)}")
140
+
141
+ def _load_embeddings(self):
142
+ """Load your existing AMP embeddings."""
143
+ print(f"Loading AMP embeddings from {self.embeddings_path}...")
144
+
145
+ # Try to load the combined embeddings file first (FULL DATA)
146
+ combined_path = os.path.join(self.embeddings_path, "all_peptide_embeddings.pt")
147
+
148
+ if os.path.exists(combined_path):
149
+ print(f"Loading combined embeddings from {combined_path} (FULL DATA)...")
150
+ # Load on CPU first to avoid CUDA issues with DataLoader workers
151
+ self.embeddings = torch.load(combined_path, map_location='cpu')
152
+ print(f"✓ Loaded ALL embeddings: {self.embeddings.shape}")
153
+ else:
154
+ print("Combined embeddings file not found, loading individual files...")
155
+ # Fallback to individual files
156
+ import glob
157
+
158
+ embedding_files = glob.glob(os.path.join(self.embeddings_path, "*.pt"))
159
+ embedding_files = [f for f in embedding_files if not f.endswith('metadata.json') and not f.endswith('sequence_ids.json') and not f.endswith('all_peptide_embeddings.pt')]
160
+
161
+ print(f"Found {len(embedding_files)} individual embedding files")
162
+
163
+ # Load and stack all embeddings
164
+ embeddings_list = []
165
+ for file_path in embedding_files:
166
+ try:
167
+ embedding = torch.load(file_path, map_location='cpu')
168
+ if embedding.dim() == 2: # (seq_len, hidden_dim)
169
+ embeddings_list.append(embedding)
170
+ else:
171
+ print(f"Warning: Skipping {file_path} - unexpected shape {embedding.shape}")
172
+ except Exception as e:
173
+ print(f"Warning: Could not load {file_path}: {e}")
174
+
175
+ if not embeddings_list:
176
+ raise ValueError("No valid embeddings found!")
177
+
178
+ self.embeddings = torch.stack(embeddings_list)
179
+ print(f"Loaded {len(self.embeddings)} embeddings from individual files")
180
+
181
+ def _load_cfg_data(self):
182
+ """Load CFG data from UniProt processing."""
183
+ print(f"Loading CFG data from {self.cfg_data_path}...")
184
+ with open(self.cfg_data_path, 'r') as f:
185
+ cfg_data = json.load(f)
186
+
187
+ self.cfg_sequences = cfg_data['sequences']
188
+ self.cfg_original_labels = np.array(cfg_data['labels'])
189
+
190
+ # For CFG training, we need to create masked labels
191
+ # Randomly mask 10% of labels for CFG training
192
+ self.cfg_masked_labels = self.cfg_original_labels.copy()
193
+ mask_probability = 0.1
194
+ mask_indices = np.random.choice(
195
+ len(self.cfg_original_labels),
196
+ size=int(len(self.cfg_original_labels) * mask_probability),
197
+ replace=False
198
+ )
199
+ self.cfg_masked_labels[mask_indices] = 2 # 2 = mask/unknown
200
+ self.cfg_mask_indices = set(mask_indices)
201
+
202
+ print(f"Loaded {len(self.cfg_sequences)} CFG sequences")
203
+ print(f"Label distribution: {np.bincount(self.cfg_original_labels)}")
204
+ print(f"Masked {len(self.cfg_mask_indices)} labels for CFG training")
205
+
206
+ def _align_data(self):
207
+ """Align AMP embeddings with CFG data based on sequence matching."""
208
+ print("Aligning AMP embeddings with CFG data...")
209
+
210
+ # For now, we'll use a simple approach: take the first N sequences
211
+ # where N is the minimum of embeddings and CFG data
212
+ min_samples = min(len(self.embeddings), len(self.cfg_sequences))
213
+
214
+ self.aligned_indices = list(range(min_samples))
215
+
216
+ # Align labels
217
+ if self.use_masked_labels:
218
+ self.cfg_labels = self.cfg_masked_labels[:min_samples]
219
+ else:
220
+ self.cfg_labels = self.cfg_original_labels[:min_samples]
221
+
222
+ # Align embeddings
223
+ self.aligned_embeddings = self.embeddings[:min_samples]
224
+
225
+ print(f"Aligned {min_samples} samples")
226
+
227
+ def __len__(self) -> int:
228
+ return len(self.aligned_indices)
229
+
230
+ def __getitem__(self, idx: int) -> Dict[str, torch.Tensor]:
231
+ """Get a single sample with embedding and CFG label."""
232
+ # Embeddings are already on CPU
233
+ embedding = self.aligned_embeddings[idx]
234
+ label = self.cfg_labels[idx]
235
+ original_label = self.cfg_original_labels[idx]
236
+ is_masked = idx in self.cfg_mask_indices
237
+
238
+ return {
239
+ 'embedding': embedding,
240
+ 'label': torch.tensor(label, dtype=torch.long),
241
+ 'original_label': torch.tensor(original_label, dtype=torch.long),
242
+ 'is_masked': torch.tensor(is_masked, dtype=torch.bool),
243
+ 'index': torch.tensor(idx, dtype=torch.long)
244
+ }
245
+
246
+ def get_embedding_stats(self) -> Dict:
247
+ """Get statistics about the embeddings."""
248
+ return {
249
+ 'shape': self.aligned_embeddings.shape,
250
+ 'mean': self.aligned_embeddings.mean().item(),
251
+ 'std': self.aligned_embeddings.std().item(),
252
+ 'min': self.aligned_embeddings.min().item(),
253
+ 'max': self.aligned_embeddings.max().item()
254
+ }
255
+
256
+ def create_cfg_dataloader(dataset: Dataset,
257
+ batch_size: int = 32,
258
+ shuffle: bool = True,
259
+ num_workers: int = 4) -> DataLoader:
260
+ """Create a DataLoader for CFG training."""
261
+
262
+ def collate_fn(batch):
263
+ """Custom collate function for CFG data."""
264
+ # Separate different types of data
265
+ embeddings = torch.stack([item['embedding'] for item in batch])
266
+ labels = torch.stack([item['label'] for item in batch])
267
+ original_labels = torch.stack([item['original_label'] for item in batch])
268
+ is_masked = torch.stack([item['is_masked'] for item in batch])
269
+ indices = torch.stack([item['index'] for item in batch])
270
+
271
+ return {
272
+ 'embeddings': embeddings,
273
+ 'labels': labels,
274
+ 'original_labels': original_labels,
275
+ 'is_masked': is_masked,
276
+ 'indices': indices
277
+ }
278
+
279
+ return DataLoader(
280
+ dataset,
281
+ batch_size=batch_size,
282
+ shuffle=shuffle,
283
+ num_workers=num_workers,
284
+ collate_fn=collate_fn,
285
+ pin_memory=True
286
+ )
287
+
288
+ def test_cfg_dataset():
289
+ """Test function to verify the CFG dataset works correctly."""
290
+ print("Testing CFG Dataset...")
291
+
292
+ # Test with a small subset
293
+ test_data = {
294
+ 'sequences': ['MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG',
295
+ 'MKLLIVTFCLTFAAL',
296
+ 'MKLLIVTFCLTFAALMKLLIVTFCLTFAAL'],
297
+ 'original_labels': [0, 1, 0], # amp, non_amp, amp
298
+ 'masked_labels': [0, 2, 0], # amp, mask, amp
299
+ 'mask_indices': [1] # Only second sequence is masked
300
+ }
301
+
302
+ # Save test data
303
+ test_path = 'test_cfg_data.json'
304
+ with open(test_path, 'w') as f:
305
+ json.dump(test_data, f)
306
+
307
+ # Test dataset
308
+ dataset = CFGUniProtDataset(test_path, use_masked_labels=True)
309
+
310
+ print(f"Dataset length: {len(dataset)}")
311
+ for i in range(len(dataset)):
312
+ sample = dataset[i]
313
+ print(f"Sample {i}:")
314
+ print(f" Sequence: {sample['sequence'][:20]}...")
315
+ print(f" Label: {sample['label'].item()}")
316
+ print(f" Original Label: {sample['original_label'].item()}")
317
+ print(f" Is Masked: {sample['is_masked'].item()}")
318
+
319
+ # Clean up
320
+ os.remove(test_path)
321
+ print("Test completed successfully!")
322
+
323
+ if __name__ == "__main__":
324
+ test_cfg_dataset()
compressor_with_embeddings.py ADDED
@@ -0,0 +1,278 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import torch
2
+ import torch.nn as nn
3
+ import torch.optim as optim
4
+ from torch.utils.data import Dataset, DataLoader
5
+ from torch.optim.lr_scheduler import LinearLR, CosineAnnealingLR, SequentialLR
6
+ import json
7
+ import numpy as np
8
+ from tqdm import tqdm
9
+
10
+ # ---------------- Hyperparameters ----------------
11
+ ESM_DIM = 1280 # ESM-2 hidden dim (esm2_t33_650M_UR50D)
12
+ COMP_RATIO = 16 # compression factor
13
+ COMP_DIM = ESM_DIM // COMP_RATIO
14
+ MAX_SEQ_LEN = 50 # Actual sequence length from final_sequence_encoder.py
15
+ BATCH_SIZE = 32
16
+ EPOCHS = 30
17
+ BASE_LR = 1e-3 # initial learning rate
18
+ LR_MIN = 8e-5 # minimum learning rate for cosine schedule
19
+ WARMUP_STEPS = 10_000
20
+ DEPTH = 4 # total transformer layers (2 pre-pool, 2 post-pool)
21
+ HEADS = 8 # attention heads
22
+ DIM_FF = ESM_DIM * 4
23
+ POOLING = True # enforce ProtFlow hourglass pooling
24
+
25
+ # ---------------- Dataset for Pre-computed Embeddings ----------------
26
+ class PrecomputedEmbeddingDataset(Dataset):
27
+ def __init__(self, embeddings_path):
28
+ """
29
+ Load pre-computed embeddings from the final_sequence_encoder.py output.
30
+ Args:
31
+ embeddings_path: Path to the directory containing individual .pt embedding files
32
+ """
33
+ print(f"Loading pre-computed embeddings from {embeddings_path}...")
34
+
35
+ # Load all individual embedding files
36
+ import glob
37
+ import os
38
+
39
+ embedding_files = glob.glob(os.path.join(embeddings_path, "*.pt"))
40
+ embedding_files = [f for f in embedding_files if not f.endswith('metadata.json') and not f.endswith('sequence_ids.json')]
41
+
42
+ print(f"Found {len(embedding_files)} embedding files")
43
+
44
+ # Load and stack all embeddings
45
+ embeddings_list = []
46
+ for file_path in embedding_files:
47
+ try:
48
+ embedding = torch.load(file_path)
49
+ if embedding.dim() == 2: # (seq_len, hidden_dim)
50
+ embeddings_list.append(embedding)
51
+ else:
52
+ print(f"Warning: Skipping {file_path} - unexpected shape {embedding.shape}")
53
+ except Exception as e:
54
+ print(f"Warning: Could not load {file_path}: {e}")
55
+
56
+ if not embeddings_list:
57
+ raise ValueError("No valid embeddings found!")
58
+
59
+ self.embeddings = torch.stack(embeddings_list)
60
+ print(f"Loaded {len(self.embeddings)} embeddings with shape {self.embeddings.shape}")
61
+
62
+ # Ensure embeddings are the right shape
63
+ if len(self.embeddings.shape) != 3:
64
+ raise ValueError(f"Expected 3D tensor, got shape {self.embeddings.shape}")
65
+
66
+ if self.embeddings.shape[1] != MAX_SEQ_LEN:
67
+ print(f"Warning: Expected sequence length {MAX_SEQ_LEN}, got {self.embeddings.shape[1]}")
68
+
69
+ if self.embeddings.shape[2] != ESM_DIM:
70
+ print(f"Warning: Expected embedding dim {ESM_DIM}, got {self.embeddings.shape[2]}")
71
+
72
+ def __len__(self):
73
+ return len(self.embeddings)
74
+
75
+ def __getitem__(self, idx):
76
+ return self.embeddings[idx]
77
+
78
+ # ---------------- Compressor ----------------
79
+ class Compressor(nn.Module):
80
+ def __init__(self, in_dim=ESM_DIM, out_dim=COMP_DIM):
81
+ super().__init__()
82
+ self.norm = nn.LayerNorm(in_dim)
83
+ layer = lambda: nn.TransformerEncoderLayer(
84
+ d_model=in_dim, nhead=HEADS, dim_feedforward=DIM_FF,
85
+ batch_first=True)
86
+ # two layers before pool, two after
87
+ self.pre_tr = nn.TransformerEncoder(layer(), num_layers=DEPTH//2)
88
+ self.post_tr = nn.TransformerEncoder(layer(), num_layers=DEPTH//2)
89
+ self.proj = nn.Sequential(
90
+ nn.LayerNorm(in_dim),
91
+ nn.Linear(in_dim, out_dim),
92
+ nn.Tanh()
93
+ )
94
+ self.pooling = POOLING
95
+
96
+ def forward(self, x, stats=None):
97
+ if stats:
98
+ m, s, mn, mx = stats['mean'], stats['std'], stats['min'], stats['max']
99
+ # Move stats to the same device as x
100
+ m = m.to(x.device)
101
+ s = s.to(x.device)
102
+ mn = mn.to(x.device)
103
+ mx = mx.to(x.device)
104
+ x = torch.clamp((x - m) / s, -4, 4)
105
+ x = torch.clamp((x - mn) / (mx - mn + 1e-8), 0, 1)
106
+ x = self.norm(x)
107
+ x = self.pre_tr(x) # [B, L, D]
108
+ if self.pooling:
109
+ B, L, D = x.shape
110
+ if L % 2: x = x[:, :-1, :]
111
+ x = x.view(B, L//2, 2, D).mean(2) # halve sequence length
112
+ x = self.post_tr(x) # [B, L' , D]
113
+ return self.proj(x) # [B, L', COMP_DIM]
114
+
115
+ # ---------------- Decompressor ----------------
116
+ class Decompressor(nn.Module):
117
+ def __init__(self, in_dim=COMP_DIM, out_dim=ESM_DIM):
118
+ super().__init__()
119
+ self.proj = nn.Sequential(
120
+ nn.LayerNorm(in_dim),
121
+ nn.Linear(in_dim, out_dim)
122
+ )
123
+ layer = lambda: nn.TransformerEncoderLayer(
124
+ d_model=out_dim, nhead=HEADS, dim_feedforward=DIM_FF,
125
+ batch_first=True)
126
+ self.decoder = nn.TransformerEncoder(layer(), num_layers=DEPTH//2)
127
+ self.pooling = POOLING
128
+
129
+ def forward(self, z):
130
+ x = self.proj(z) # [B, L', D]
131
+ if self.pooling:
132
+ x = x.repeat_interleave(2, dim=1) # unpool to full length
133
+ return self.decoder(x) # [B, L, out_dim]
134
+
135
+ # ---------------- Training Loop ----------------
136
+ def train_with_precomputed_embeddings(embeddings_path, device='cuda'):
137
+ """
138
+ Train compressor using pre-computed embeddings from final_sequence_encoder.py
139
+ """
140
+ # Load dataset
141
+ ds = PrecomputedEmbeddingDataset(embeddings_path)
142
+
143
+ # Compute normalization statistics
144
+ print("Computing normalization statistics...")
145
+ flat = ds.embeddings.view(-1, ESM_DIM)
146
+ stats = {
147
+ 'mean': flat.mean(0),
148
+ 'std': flat.std(0) + 1e-8,
149
+ 'min': torch.clamp((flat - flat.mean(0)) / (flat.std(0) + 1e-8), -4,4).min(0)[0],
150
+ 'max': torch.clamp((flat - flat.mean(0)) / (flat.std(0) + 1e-8), -4,4).max(0)[0]
151
+ }
152
+
153
+ # Save statistics for later use
154
+ torch.save(stats, 'normalization_stats.pt')
155
+ print("Saved normalization statistics to normalization_stats.pt")
156
+
157
+ # Create data loader
158
+ dl = DataLoader(ds, batch_size=BATCH_SIZE, shuffle=True)
159
+
160
+ # Initialize models
161
+ comp = Compressor().to(device)
162
+ decomp = Decompressor().to(device)
163
+
164
+ # Initialize optimizer
165
+ opt = optim.AdamW(list(comp.parameters()) + list(decomp.parameters()), lr=BASE_LR)
166
+
167
+ # LR scheduling: warmup -> cosine
168
+ warmup_sched = LinearLR(opt, start_factor=1e-8, end_factor=1.0, total_iters=WARMUP_STEPS)
169
+ cosine_sched = CosineAnnealingLR(opt, T_max=EPOCHS*len(dl), eta_min=LR_MIN)
170
+ sched = SequentialLR(opt, [warmup_sched, cosine_sched], milestones=[WARMUP_STEPS])
171
+
172
+ print(f"Starting training for {EPOCHS} epochs...")
173
+ print(f"Device: {device}")
174
+ print(f"Batch size: {BATCH_SIZE}")
175
+ print(f"Total batches per epoch: {len(dl)}")
176
+
177
+ # Training loop
178
+ for epoch in range(1, EPOCHS+1):
179
+ total_loss = 0
180
+ comp.train()
181
+ decomp.train()
182
+
183
+ for batch_idx, x in enumerate(tqdm(dl, desc=f"Epoch {epoch}/{EPOCHS}")):
184
+ x = x.to(device)
185
+ z = comp(x, stats)
186
+ xr = decomp(z)
187
+ loss = (x - xr).pow(2).mean()
188
+
189
+ opt.zero_grad()
190
+ loss.backward()
191
+ opt.step()
192
+ sched.step()
193
+
194
+ total_loss += loss.item()
195
+
196
+ # Print progress every 100 batches
197
+ if batch_idx % 100 == 0:
198
+ print(f" Batch {batch_idx}/{len(dl)} - Loss: {loss.item():.6f}")
199
+
200
+ avg_loss = total_loss / len(dl)
201
+ print(f"Epoch {epoch}/{EPOCHS} — Average MSE: {avg_loss:.6f}")
202
+
203
+ # Save checkpoint every 5 epochs
204
+ if epoch % 5 == 0:
205
+ torch.save({
206
+ 'epoch': epoch,
207
+ 'compressor_state_dict': comp.state_dict(),
208
+ 'decompressor_state_dict': decomp.state_dict(),
209
+ 'optimizer_state_dict': opt.state_dict(),
210
+ 'loss': avg_loss,
211
+ }, f'checkpoint_epoch_{epoch}.pth')
212
+
213
+ # Save final models
214
+ torch.save(comp.state_dict(), 'compressor_final.pth')
215
+ torch.save(decomp.state_dict(), 'decompressor_final.pth')
216
+ print("Training completed! Models saved as compressor_final.pth and decompressor_final.pth")
217
+
218
+ # ---------------- Utility Functions ----------------
219
+ def load_and_test_models(compressor_path, decompressor_path, embeddings_path, device='cuda'):
220
+ """
221
+ Load trained models and test reconstruction quality
222
+ """
223
+ print("Loading trained models...")
224
+ comp = Compressor().to(device)
225
+ decomp = Decompressor().to(device)
226
+
227
+ comp.load_state_dict(torch.load(compressor_path))
228
+ decomp.load_state_dict(torch.load(decompressor_path))
229
+
230
+ comp.eval()
231
+ decomp.eval()
232
+
233
+ # Load test data
234
+ ds = PrecomputedEmbeddingDataset(embeddings_path)
235
+ test_loader = DataLoader(ds, batch_size=16, shuffle=False)
236
+
237
+ # Load normalization stats
238
+ stats = torch.load('normalization_stats.pt')
239
+
240
+ print("Testing reconstruction quality...")
241
+ total_mse = 0
242
+ total_samples = 0
243
+
244
+ with torch.no_grad():
245
+ for batch in tqdm(test_loader, desc="Testing"):
246
+ x = batch.to(device)
247
+ z = comp(x, stats)
248
+ xr = decomp(z)
249
+ mse = (x - xr).pow(2).mean()
250
+ total_mse += mse.item() * len(x)
251
+ total_samples += len(x)
252
+
253
+ avg_mse = total_mse / total_samples
254
+ print(f"Average reconstruction MSE: {avg_mse:.6f}")
255
+
256
+ return avg_mse
257
+
258
+ # ---------------- Entrypoint ----------------
259
+ if __name__ == '__main__':
260
+ import argparse
261
+
262
+ parser = argparse.ArgumentParser(description='Train protein compressor with pre-computed embeddings')
263
+ parser.add_argument('--embeddings', type=str, default='/data2/edwardsun/flow_project/compressor_dataset/peptide_embeddings.pt',
264
+ help='Path to pre-computed embeddings from final_sequence_encoder.py')
265
+ parser.add_argument('--device', type=str, default='cuda', help='Device to use (cuda/cpu)')
266
+ parser.add_argument('--test', action='store_true', help='Test existing models instead of training')
267
+
268
+ args = parser.parse_args()
269
+
270
+ device = torch.device(args.device if torch.cuda.is_available() else 'cpu')
271
+ print(f"Using device: {device}")
272
+
273
+ if args.test:
274
+ # Test existing models
275
+ load_and_test_models('compressor_final.pth', 'decompressor_final.pth', args.embeddings, device)
276
+ else:
277
+ # Train new models
278
+ train_with_precomputed_embeddings(args.embeddings, device)
final_flow_model.py ADDED
@@ -0,0 +1,310 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import torch
2
+ import torch.nn as nn
3
+ import torch.nn.functional as F
4
+ import math
5
+
6
+ class SinusoidalTimeEmbedding(nn.Module):
7
+ """Sinusoidal time embedding as used in ProtFlow paper."""
8
+
9
+ def __init__(self, dim):
10
+ super().__init__()
11
+ self.dim = dim
12
+
13
+ def forward(self, time):
14
+ device = time.device
15
+ half_dim = self.dim // 2
16
+ embeddings = math.log(10000) / (half_dim - 1)
17
+ embeddings = torch.exp(torch.arange(half_dim, device=device) * -embeddings)
18
+ # Ensure time is 2D: [B, 1] and embeddings is 1D: [half_dim]
19
+ if time.dim() > 2:
20
+ time = time.squeeze() # Remove extra dimensions
21
+ embeddings = time.unsqueeze(-1) * embeddings.unsqueeze(0) # [B, half_dim]
22
+ embeddings = torch.cat((embeddings.sin(), embeddings.cos()), dim=-1) # [B, dim]
23
+ # Ensure output is exactly 2D
24
+ if embeddings.dim() > 2:
25
+ embeddings = embeddings.squeeze()
26
+ return embeddings
27
+
28
+ class LabelMLP(nn.Module):
29
+ """
30
+ MLP for processing class labels into embeddings.
31
+ This approach processes labels separately from time embeddings.
32
+ """
33
+ def __init__(self, num_classes=3, hidden_dim=480, mlp_dim=256):
34
+ super().__init__()
35
+ self.num_classes = num_classes
36
+
37
+ # MLP to process labels
38
+ self.label_mlp = nn.Sequential(
39
+ nn.Embedding(num_classes, mlp_dim),
40
+ nn.Linear(mlp_dim, mlp_dim),
41
+ nn.GELU(),
42
+ nn.Linear(mlp_dim, hidden_dim),
43
+ nn.GELU(),
44
+ nn.Linear(hidden_dim, hidden_dim)
45
+ )
46
+
47
+ # Initialize embeddings
48
+ nn.init.normal_(self.label_mlp[0].weight, std=0.02)
49
+
50
+ def forward(self, labels):
51
+ """
52
+ Args:
53
+ labels: (B,) tensor of class labels
54
+ - 0: AMP (MIC < 100)
55
+ - 1: Non-AMP (MIC >= 100)
56
+ - 2: Mask (Unknown MIC)
57
+ Returns:
58
+ embeddings: (B, hidden_dim) tensor of processed label embeddings
59
+ """
60
+ return self.label_mlp(labels)
61
+
62
+ class AMPFlowMatcherCFGConcat(nn.Module):
63
+ """
64
+ Flow Matching model with Classifier-Free Guidance using concatenation approach.
65
+ - 12-layer transformer with long skip connections
66
+ - Time embedding + MLP-processed label embedding (concatenated then projected)
67
+ - Optimized for peptide sequences (max length 50)
68
+ """
69
+
70
+ def __init__(self, hidden_dim=480, compressed_dim=30, n_layers=12, n_heads=16,
71
+ dim_ff=3072, dropout=0.1, max_seq_len=25, use_cfg=True):
72
+ super().__init__()
73
+ self.hidden_dim = hidden_dim
74
+ self.compressed_dim = compressed_dim
75
+ self.n_layers = n_layers
76
+ self.max_seq_len = max_seq_len
77
+ self.use_cfg = use_cfg
78
+
79
+ # Time embedding
80
+ self.time_embed = nn.Sequential(
81
+ SinusoidalTimeEmbedding(hidden_dim),
82
+ nn.Linear(hidden_dim, hidden_dim),
83
+ nn.GELU(),
84
+ nn.Linear(hidden_dim, hidden_dim)
85
+ )
86
+
87
+ # CFG components using concatenation approach
88
+ if use_cfg:
89
+ self.label_mlp = LabelMLP(num_classes=3, hidden_dim=hidden_dim)
90
+
91
+ # Projection layer for concatenated time + label embeddings
92
+ self.condition_proj = nn.Sequential(
93
+ nn.Linear(hidden_dim * 2, hidden_dim), # 2 for time + label
94
+ nn.GELU(),
95
+ nn.Linear(hidden_dim, hidden_dim)
96
+ )
97
+
98
+ # Projection layers for compressed space
99
+ self.compress_proj = nn.Linear(compressed_dim, hidden_dim)
100
+ self.decompress_proj = nn.Linear(hidden_dim, compressed_dim)
101
+
102
+ # Positional encoding for peptide sequences
103
+ self.pos_embed = nn.Parameter(torch.randn(1, max_seq_len, hidden_dim))
104
+
105
+ # Transformer layers with long skip connections
106
+ self.layers = nn.ModuleList([
107
+ nn.TransformerEncoderLayer(
108
+ d_model=hidden_dim,
109
+ nhead=n_heads,
110
+ dim_feedforward=dim_ff,
111
+ dropout=dropout,
112
+ activation='gelu',
113
+ batch_first=True
114
+ ) for _ in range(n_layers)
115
+ ])
116
+
117
+ # Long skip connections (U-ViT style)
118
+ self.skip_projections = nn.ModuleList([
119
+ nn.Linear(hidden_dim, hidden_dim) for _ in range(n_layers - 1)
120
+ ])
121
+
122
+ # Output projection
123
+ self.output_proj = nn.Linear(hidden_dim, compressed_dim)
124
+
125
+ def forward(self, x, t, labels=None, mask=None):
126
+ """
127
+ Args:
128
+ x: compressed latent (B, L, compressed_dim) - AMP embeddings
129
+ t: time scalar (B,) or (B, 1)
130
+ labels: class labels (B,) for CFG - 0=AMP, 1=Non-AMP, 2=Mask
131
+ mask: attention mask (B, L) if needed
132
+ """
133
+ B, L, D = x.shape
134
+
135
+ # Project to hidden dimension
136
+ x = self.compress_proj(x) # (B, L, hidden_dim)
137
+
138
+ # Add positional encoding
139
+ if L <= self.max_seq_len:
140
+ x = x + self.pos_embed[:, :L, :]
141
+
142
+ # Time embedding - ensure t is 2D (B, 1)
143
+ if t.dim() == 1:
144
+ t = t.unsqueeze(-1) # (B, 1)
145
+ elif t.dim() > 2:
146
+ t = t.squeeze() # Remove extra dimensions
147
+ if t.dim() == 1:
148
+ t = t.unsqueeze(-1) # (B, 1)
149
+
150
+ t_emb = self.time_embed(t) # (B, hidden_dim)
151
+ # Ensure t_emb is 2D before expanding
152
+ if t_emb.dim() > 2:
153
+ t_emb = t_emb.squeeze() # Remove extra dimensions
154
+ t_emb = t_emb.unsqueeze(1).expand(-1, L, -1) # (B, L, hidden_dim)
155
+
156
+ # CFG: Process label embedding if enabled
157
+ if self.use_cfg and labels is not None:
158
+ # Process labels through MLP
159
+ label_emb = self.label_mlp(labels) # (B, hidden_dim)
160
+ label_emb = label_emb.unsqueeze(1).expand(-1, L, -1) # (B, L, hidden_dim)
161
+
162
+ # Professor's approach: Concatenate time and label embeddings
163
+ combined_emb = torch.cat([t_emb, label_emb], dim=-1) # (B, L, hidden_dim*2)
164
+ projected_emb = self.condition_proj(combined_emb) # (B, L, hidden_dim)
165
+ else:
166
+ projected_emb = t_emb # Just use time embedding if no CFG
167
+
168
+ # Store intermediate representations for skip connections
169
+ skip_features = []
170
+
171
+ # Pass through transformer layers with skip connections
172
+ for i, layer in enumerate(self.layers):
173
+ # Add skip connection from earlier layers
174
+ if i > 0 and i < len(self.layers) - 1:
175
+ skip_feat = skip_features[i-1]
176
+ skip_feat = self.skip_projections[i-1](skip_feat)
177
+ x = x + skip_feat
178
+
179
+ # Store current features for future skip connections
180
+ if i < len(self.layers) - 1:
181
+ skip_features.append(x.clone())
182
+
183
+ # Add projected condition embedding to EACH layer
184
+ x = x + projected_emb
185
+
186
+ # Apply transformer layer
187
+ x = layer(x, src_key_padding_mask=mask)
188
+
189
+ # Project back to compressed dimension
190
+ x = self.output_proj(x) # (B, L, compressed_dim)
191
+
192
+ return x
193
+
194
+ class AMPProtFlowPipelineCFG:
195
+ """
196
+ Complete ProtFlow pipeline for AMP generation with CFG.
197
+ """
198
+
199
+ def __init__(self, compressor, decompressor, flow_model, device='cuda'):
200
+ self.compressor = compressor
201
+ self.decompressor = decompressor
202
+ self.flow_model = flow_model
203
+ self.device = device
204
+
205
+ # Load normalization stats
206
+ self.stats = torch.load('normalization_stats.pt', map_location=device)
207
+
208
+ def generate_amps_cfg(self, num_samples=100, num_steps=25, cfg_scale=7.5,
209
+ condition_label=0):
210
+ """
211
+ Generate AMP samples using CFG.
212
+
213
+ Args:
214
+ num_samples: Number of samples to generate
215
+ num_steps: Number of ODE solving steps
216
+ cfg_scale: CFG guidance scale (higher = stronger conditioning)
217
+ condition_label: 0=AMP, 1=Non-AMP, 2=Mask
218
+ """
219
+ print(f"Generating {num_samples} samples with CFG (label={condition_label}, scale={cfg_scale})...")
220
+
221
+ # Sample random noise
222
+ batch_size = min(num_samples, 32) # Process in batches
223
+ all_samples = []
224
+
225
+ for i in range(0, num_samples, batch_size):
226
+ current_batch = min(batch_size, num_samples - i)
227
+
228
+ # Initialize with noise
229
+ eps = torch.randn(current_batch, self.flow_model.max_seq_len,
230
+ self.flow_model.compressed_dim, device=self.device)
231
+
232
+ # ODE solving steps with CFG
233
+ xt = eps.clone()
234
+ for step in range(num_steps):
235
+ t = torch.ones(current_batch, device=self.device) * (1.0 - step/num_steps)
236
+
237
+ # CFG: Generate with condition and without condition
238
+ if cfg_scale > 0:
239
+ # With condition
240
+ vt_cond = self.flow_model(xt, t,
241
+ labels=torch.full((current_batch,), condition_label,
242
+ device=self.device))
243
+
244
+ # Without condition (mask)
245
+ vt_uncond = self.flow_model(xt, t,
246
+ labels=torch.full((current_batch,), 2,
247
+ device=self.device))
248
+
249
+ # CFG interpolation
250
+ vt = vt_uncond + cfg_scale * (vt_cond - vt_uncond)
251
+ else:
252
+ # No CFG, use mask label
253
+ vt = self.flow_model(xt, t,
254
+ labels=torch.full((current_batch,), 2,
255
+ device=self.device))
256
+
257
+ # Euler step for backward integration (t: 1 -> 0)
258
+ # Use negative dt to integrate backward from noise to data
259
+ dt = -1.0 / num_steps
260
+ xt = xt + vt * dt
261
+
262
+ all_samples.append(xt)
263
+
264
+ # Concatenate all batches
265
+ generated = torch.cat(all_samples, dim=0)
266
+
267
+ # Decompress and decode
268
+ with torch.no_grad():
269
+ # Decompress
270
+ decompressed = self.decompressor(generated)
271
+
272
+ # Apply reverse normalization
273
+ m, s, mn, mx = self.stats['mean'], self.stats['std'], self.stats['min'], self.stats['max']
274
+ decompressed = decompressed * (mx - mn + 1e-8) + mn
275
+ decompressed = decompressed * s + m
276
+
277
+ return generated, decompressed
278
+
279
+ # Example usage
280
+ if __name__ == "__main__":
281
+ # Initialize FINAL AMP flow model with CFG using concatenation approach
282
+ flow_model = AMPFlowMatcherCFGConcat(
283
+ hidden_dim=480,
284
+ compressed_dim=30, # 16x compression of 480
285
+ n_layers=12,
286
+ n_heads=16,
287
+ dim_ff=3072,
288
+ max_seq_len=25, # For AMP sequences (max 50, halved by pooling)
289
+ use_cfg=True
290
+ )
291
+
292
+ print(f"FINAL AMP Flow Model with CFG (Concat+Proj) parameters: {sum(p.numel() for p in flow_model.parameters()):,}")
293
+
294
+ # Test forward pass
295
+ batch_size = 4
296
+ seq_len = 20
297
+ compressed_dim = 30
298
+
299
+ x = torch.randn(batch_size, seq_len, compressed_dim)
300
+ t = torch.rand(batch_size)
301
+ labels = torch.randint(0, 3, (batch_size,)) # Random labels
302
+
303
+ with torch.no_grad():
304
+ output = flow_model(x, t, labels=labels)
305
+ print(f"Input shape: {x.shape}")
306
+ print(f"Output shape: {output.shape}")
307
+ print(f"Time embedding shape: {t.shape}")
308
+ print(f"Labels: {labels}")
309
+
310
+ print("🎯 FINAL AMP Flow Model with CFG (Concat+Proj) ready for training!")
final_sequence_decoder.py ADDED
@@ -0,0 +1,338 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import torch
2
+ import torch.nn.functional as F
3
+ import numpy as np
4
+ import esm
5
+ from tqdm import tqdm
6
+ import os
7
+ from datetime import datetime
8
+
9
+ class EmbeddingToSequenceConverter:
10
+ """
11
+ Convert ESM embeddings back to amino acid sequences using real ESM2 token embeddings.
12
+ """
13
+
14
+ def __init__(self, device='cuda'):
15
+ self.device = device
16
+
17
+ # Load ESM model
18
+ print("Loading ESM model for sequence decoding...")
19
+ self.model, self.alphabet = esm.pretrained.esm2_t33_650M_UR50D()
20
+ self.model = self.model.to(device)
21
+ self.model.eval()
22
+
23
+ # Get vocabulary
24
+ self.vocab = self.alphabet.standard_toks
25
+ self.vocab_list = [token for token in self.vocab if token not in ['<cls>', '<eos>', '<unk>', '<pad>', '<mask>']]
26
+
27
+ # Pre-compute token embeddings for nearest neighbor search
28
+ self._precompute_token_embeddings()
29
+
30
+ print("✓ ESM model loaded for sequence decoding")
31
+
32
+ def _precompute_token_embeddings(self):
33
+ """
34
+ Pre-compute embeddings for all tokens in the vocabulary using real ESM2 embeddings.
35
+ """
36
+ print("Pre-computing token embeddings from ESM2 model...")
37
+
38
+ # Use standard amino acids
39
+ standard_aas = 'ACDEFGHIKLMNPQRSTVWY'
40
+ self.token_list = list(standard_aas)
41
+
42
+ # Extract real embeddings from ESM2 model
43
+ with torch.no_grad():
44
+ # Get token indices for each amino acid
45
+ aa_tokens = []
46
+ for aa in standard_aas:
47
+ try:
48
+ token_idx = self.alphabet.get_idx(aa)
49
+ aa_tokens.append(token_idx)
50
+ except:
51
+ print(f"Warning: Could not find token for amino acid {aa}")
52
+ # Fallback to a default token
53
+ aa_tokens.append(0)
54
+
55
+ # Convert to tensor
56
+ aa_tokens = torch.tensor(aa_tokens, device=self.device)
57
+
58
+ # Extract embeddings from ESM2's embedding layer
59
+ # Note: ESM2 uses a different embedding structure, so we'll use the model's forward pass
60
+ # Create dummy sequences for each amino acid
61
+ dummy_sequences = [(f"aa_{i}", aa) for i, aa in enumerate(standard_aas)]
62
+
63
+ # Get embeddings using the same method as the encoder
64
+ converter = self.alphabet.get_batch_converter()
65
+ _, _, tokens = converter(dummy_sequences)
66
+ tokens = tokens.to(self.device)
67
+
68
+ # Get embeddings from layer 33 (same as encoder)
69
+ with torch.no_grad():
70
+ out = self.model(tokens, repr_layers=[33], return_contacts=False)
71
+ reps = out['representations'][33] # [B, L+2, D]
72
+
73
+ # Extract per-residue embeddings (remove CLS and EOS tokens)
74
+ token_embeddings = []
75
+ for i, (_, seq) in enumerate(dummy_sequences):
76
+ L = len(seq)
77
+ emb = reps[i, 1:1+L, :] # Remove CLS and EOS tokens
78
+ # Take the first position embedding for each amino acid
79
+ token_embeddings.append(emb[0])
80
+
81
+ self.token_embeddings = torch.stack(token_embeddings)
82
+
83
+ print(f"✓ Pre-computed embeddings for {len(self.token_embeddings)} tokens")
84
+ print(f" Embedding shape: {self.token_embeddings.shape}")
85
+
86
+ def embedding_to_sequence(self, embedding, method='diverse', temperature=0.5):
87
+ """
88
+ Convert a single embedding back to amino acid sequence.
89
+
90
+ Args:
91
+ embedding: [seq_len, embed_dim] tensor
92
+ method: 'diverse', 'nearest_neighbor', or 'random'
93
+ temperature: Temperature for diverse sampling (lower = more diverse)
94
+
95
+ Returns:
96
+ sequence: string of amino acids
97
+ """
98
+ if method == 'diverse':
99
+ return self._diverse_decode(embedding, temperature)
100
+ elif method == 'nearest_neighbor':
101
+ return self._nearest_neighbor_decode(embedding)
102
+ elif method == 'random':
103
+ return self._random_decode(embedding)
104
+ else:
105
+ raise ValueError(f"Unknown method: {method}")
106
+
107
+ def _diverse_decode(self, embedding, temperature=0.5):
108
+ """
109
+ Decode using diverse sampling with temperature control.
110
+ """
111
+ # Ensure both tensors are on the same device
112
+ embedding = embedding.to(self.device)
113
+ token_embeddings = self.token_embeddings.to(self.device)
114
+
115
+ # Compute cosine similarity between embedding and all token embeddings
116
+ embedding_norm = F.normalize(embedding, dim=-1) # [seq_len, embed_dim]
117
+ token_embeddings_norm = F.normalize(token_embeddings, dim=-1) # [vocab_size, embed_dim]
118
+
119
+ # Compute similarities
120
+ similarities = torch.mm(embedding_norm, token_embeddings_norm.t()) # [seq_len, vocab_size]
121
+
122
+ # Apply temperature to increase diversity
123
+ similarities = similarities / temperature
124
+
125
+ # Convert to probabilities
126
+ probs = F.softmax(similarities, dim=-1)
127
+
128
+ # Sample from the distribution
129
+ sampled_indices = torch.multinomial(probs, 1).squeeze(-1)
130
+
131
+ # Convert to sequence
132
+ sequence = ''.join([self.token_list[idx] for idx in sampled_indices.cpu().numpy()])
133
+
134
+ return sequence
135
+
136
+ def _nearest_neighbor_decode(self, embedding):
137
+ """
138
+ Decode using nearest neighbor search in token embedding space.
139
+ """
140
+ # Ensure both tensors are on the same device
141
+ embedding = embedding.to(self.device)
142
+ token_embeddings = self.token_embeddings.to(self.device)
143
+
144
+ # Compute cosine similarity between embedding and all token embeddings
145
+ embedding_norm = F.normalize(embedding, dim=-1) # [seq_len, embed_dim]
146
+ token_embeddings_norm = F.normalize(token_embeddings, dim=-1) # [vocab_size, embed_dim]
147
+
148
+ # Compute similarities
149
+ similarities = torch.mm(embedding_norm, token_embeddings_norm.t()) # [seq_len, vocab_size]
150
+
151
+ # Find nearest neighbors
152
+ nearest_indices = torch.argmax(similarities, dim=-1) # [seq_len]
153
+
154
+ # Convert to sequence
155
+ sequence = ''.join([self.token_list[idx] for idx in nearest_indices.cpu().numpy()])
156
+
157
+ return sequence
158
+
159
+ def _random_decode(self, embedding):
160
+ """
161
+ Decode using random sampling (fallback method).
162
+ """
163
+ seq_len = embedding.shape[0]
164
+ sequence = ''.join(np.random.choice(self.token_list, seq_len))
165
+ return sequence
166
+
167
+ def batch_embedding_to_sequences(self, embeddings, method='diverse', temperature=0.5):
168
+ """
169
+ Convert batch of embeddings to sequences.
170
+
171
+ Args:
172
+ embeddings: [batch_size, seq_len, embed_dim] tensor
173
+ method: decoding method
174
+ temperature: Temperature for diverse sampling
175
+
176
+ Returns:
177
+ sequences: list of strings
178
+ """
179
+ sequences = []
180
+
181
+ for i in tqdm(range(len(embeddings)), desc="Converting embeddings to sequences"):
182
+ embedding = embeddings[i]
183
+ sequence = self.embedding_to_sequence(embedding, method=method, temperature=temperature)
184
+ sequences.append(sequence)
185
+
186
+ return sequences
187
+
188
+ def validate_sequence(self, sequence):
189
+ """
190
+ Validate if a sequence contains valid amino acids.
191
+ """
192
+ valid_aas = set('ACDEFGHIKLMNPQRSTVWY')
193
+ return all(aa in valid_aas for aa in sequence)
194
+
195
+ def filter_valid_sequences(self, sequences):
196
+ """
197
+ Filter out sequences with invalid amino acids.
198
+ """
199
+ valid_sequences = []
200
+ for seq in sequences:
201
+ if self.validate_sequence(seq):
202
+ valid_sequences.append(seq)
203
+ else:
204
+ print(f"Warning: Invalid sequence found: {seq}")
205
+
206
+ return valid_sequences
207
+
208
+ def main():
209
+ """
210
+ Decode all CFG-generated peptide embeddings to sequences and analyze distribution.
211
+ Uses the best trained model (loss: 0.017183, step: 53).
212
+ """
213
+ print("=== CFG-Generated Peptide Sequence Decoder (Best Model) ===")
214
+
215
+ # Initialize converter
216
+ converter = EmbeddingToSequenceConverter()
217
+
218
+ # Get today's date for filename
219
+ today = datetime.now().strftime('%Y%m%d')
220
+
221
+ # Load all CFG-generated embeddings (using best model)
222
+ cfg_files = {
223
+ 'No CFG (0.0)': f'/data2/edwardsun/generated_samples/generated_amps_best_model_no_cfg_{today}.pt',
224
+ 'Weak CFG (3.0)': f'/data2/edwardsun/generated_samples/generated_amps_best_model_weak_cfg_{today}.pt',
225
+ 'Strong CFG (7.5)': f'/data2/edwardsun/generated_samples/generated_amps_best_model_strong_cfg_{today}.pt',
226
+ 'Very Strong CFG (15.0)': f'/data2/edwardsun/generated_samples/generated_amps_best_model_very_strong_cfg_{today}.pt'
227
+ }
228
+
229
+ all_results = {}
230
+
231
+ for cfg_name, file_path in cfg_files.items():
232
+ print(f"\n{'='*50}")
233
+ print(f"Processing {cfg_name}...")
234
+ print(f"Loading: {file_path}")
235
+
236
+ try:
237
+ # Load embeddings
238
+ embeddings = torch.load(file_path, map_location='cpu')
239
+ print(f"✓ Loaded {len(embeddings)} embeddings, shape: {embeddings.shape}")
240
+
241
+ # Decode to sequences using diverse method
242
+ print(f"Decoding sequences...")
243
+ sequences = converter.batch_embedding_to_sequences(embeddings, method='diverse', temperature=0.5)
244
+
245
+ # Filter valid sequences
246
+ valid_sequences = converter.filter_valid_sequences(sequences)
247
+ print(f"✓ Valid sequences: {len(valid_sequences)}/{len(sequences)}")
248
+
249
+ # Store results
250
+ all_results[cfg_name] = {
251
+ 'sequences': valid_sequences,
252
+ 'total': len(sequences),
253
+ 'valid': len(valid_sequences)
254
+ }
255
+
256
+ # Show sample sequences
257
+ print(f"\nSample sequences ({cfg_name}):")
258
+ for i, seq in enumerate(valid_sequences[:5]):
259
+ print(f" {i+1}: {seq}")
260
+
261
+ except Exception as e:
262
+ print(f"❌ Error processing {file_path}: {e}")
263
+ all_results[cfg_name] = {'sequences': [], 'total': 0, 'valid': 0}
264
+
265
+ # Analysis and comparison
266
+ print(f"\n{'='*60}")
267
+ print("CFG ANALYSIS SUMMARY")
268
+ print(f"{'='*60}")
269
+
270
+ for cfg_name, results in all_results.items():
271
+ sequences = results['sequences']
272
+ if sequences:
273
+ # Calculate sequence statistics
274
+ lengths = [len(seq) for seq in sequences]
275
+ avg_length = np.mean(lengths)
276
+ std_length = np.std(lengths)
277
+
278
+ # Calculate amino acid composition
279
+ all_aas = ''.join(sequences)
280
+ aa_counts = {}
281
+ for aa in 'ACDEFGHIKLMNPQRSTVWY':
282
+ aa_counts[aa] = all_aas.count(aa)
283
+
284
+ # Calculate diversity (unique sequences)
285
+ unique_sequences = len(set(sequences))
286
+ diversity_ratio = unique_sequences / len(sequences)
287
+
288
+ print(f"\n{cfg_name}:")
289
+ print(f" Total sequences: {results['total']}")
290
+ print(f" Valid sequences: {results['valid']}")
291
+ print(f" Unique sequences: {unique_sequences}")
292
+ print(f" Diversity ratio: {diversity_ratio:.3f}")
293
+ print(f" Avg length: {avg_length:.1f} ± {std_length:.1f}")
294
+ print(f" Length range: {min(lengths)}-{max(lengths)}")
295
+
296
+ # Show top amino acids
297
+ sorted_aas = sorted(aa_counts.items(), key=lambda x: x[1], reverse=True)
298
+ print(f" Top 5 AAs: {', '.join([f'{aa}({count})' for aa, count in sorted_aas[:5]])}")
299
+
300
+ # Create output directory if it doesn't exist
301
+ output_dir = '/data2/edwardsun/decoded_sequences'
302
+ os.makedirs(output_dir, exist_ok=True)
303
+
304
+ # Save sequences to file with date
305
+ output_file = os.path.join(output_dir, f"decoded_sequences_{cfg_name.lower().replace(' ', '_').replace('(', '').replace(')', '').replace('.', '')}_{today}.txt")
306
+ with open(output_file, 'w') as f:
307
+ f.write(f"# Decoded sequences from {cfg_name}\n")
308
+ f.write(f"# Total: {results['total']}, Valid: {results['valid']}, Unique: {unique_sequences}\n")
309
+ f.write(f"# Generated from best model (loss: 0.017183, step: 53)\n\n")
310
+ for i, seq in enumerate(sequences):
311
+ f.write(f"seq_{i+1:03d}\t{seq}\n")
312
+ print(f" ✓ Saved to: {output_file}")
313
+
314
+ # Overall comparison
315
+ print(f"\n{'='*60}")
316
+ print("OVERALL COMPARISON")
317
+ print(f"{'='*60}")
318
+
319
+ cfg_names = list(all_results.keys())
320
+ valid_counts = [all_results[name]['valid'] for name in cfg_names]
321
+ unique_counts = [len(set(all_results[name]['sequences'])) for name in cfg_names]
322
+
323
+ print(f"Valid sequences: {dict(zip(cfg_names, valid_counts))}")
324
+ print(f"Unique sequences: {dict(zip(cfg_names, unique_counts))}")
325
+
326
+ # Find most diverse and most similar
327
+ if all(valid_counts):
328
+ diversity_ratios = [unique_counts[i]/valid_counts[i] for i in range(len(valid_counts))]
329
+ most_diverse = cfg_names[diversity_ratios.index(max(diversity_ratios))]
330
+ least_diverse = cfg_names[diversity_ratios.index(min(diversity_ratios))]
331
+
332
+ print(f"\nMost diverse: {most_diverse} (ratio: {max(diversity_ratios):.3f})")
333
+ print(f"Least diverse: {least_diverse} (ratio: {min(diversity_ratios):.3f})")
334
+
335
+ print(f"\n✓ Decoding complete! Check the output files for detailed sequences.")
336
+
337
+ if __name__ == "__main__":
338
+ main()
final_sequence_encoder.py ADDED
@@ -0,0 +1,215 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import json
2
+ import os
3
+ import torch
4
+ import torch.nn.functional as F
5
+ import esm
6
+ from tqdm import tqdm
7
+ import numpy as np
8
+
9
+ # ---------------- Configuration ----------------
10
+ DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
11
+ BATCH_SIZE = 32 # increased for GPU efficiency
12
+ MAX_SEQ_LEN = 50 # max sequence length for AMPs
13
+ MIN_SEQ_LEN = 2 # minimum length for filtering
14
+ CANONICAL_AA = set('ACDEFGHIKLMNPQRSTVWY')
15
+
16
+ print(f"Using device: {DEVICE}")
17
+ if torch.cuda.is_available():
18
+ print(f"GPU: {torch.cuda.get_device_name()}")
19
+ print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
20
+
21
+ # ---------------- Sequence Loading ----------------
22
+ def read_peptides_json(json_file):
23
+ """
24
+ Read and filter sequences from the all_peptides_data.json file.
25
+ Extracts sequences from both main peptides and their monomers.
26
+ Filters:
27
+ - Only canonical 20 AAs
28
+ - Sequence length between MIN_SEQ_LEN and MAX_SEQ_LEN
29
+ - Non-empty sequences
30
+ Returns:
31
+ List of (seq_id, sequence) tuples.
32
+ """
33
+ print(f"Loading peptides from {json_file}...")
34
+ with open(json_file, 'r') as f:
35
+ data = json.load(f)
36
+
37
+ seqs = []
38
+ processed_ids = set()
39
+
40
+ for item in tqdm(data, desc="Processing peptides"):
41
+ # Process main peptide sequence
42
+ if 'sequence' in item and item['sequence']:
43
+ seq = item['sequence'].upper().strip()
44
+ if (MIN_SEQ_LEN <= len(seq) <= MAX_SEQ_LEN and
45
+ all(aa in CANONICAL_AA for aa in seq)):
46
+ seq_id = f"main_{item.get('id', 'unk')}"
47
+ if seq_id not in processed_ids:
48
+ seqs.append((seq_id, seq))
49
+ processed_ids.add(seq_id)
50
+
51
+ # Process monomer sequences
52
+ if 'monomers' in item and item['monomers']:
53
+ for monomer in item['monomers']:
54
+ if 'sequence' in monomer and monomer['sequence']:
55
+ seq = monomer['sequence'].upper().strip()
56
+ if (MIN_SEQ_LEN <= len(seq) <= MAX_SEQ_LEN and
57
+ all(aa in CANONICAL_AA for aa in seq)):
58
+ seq_id = f"monomer_{monomer.get('id', 'unk')}"
59
+ if seq_id not in processed_ids:
60
+ seqs.append((seq_id, seq))
61
+ processed_ids.add(seq_id)
62
+
63
+ print(f"Found {len(seqs)} valid sequences")
64
+ return seqs
65
+
66
+ @torch.no_grad()
67
+ def get_per_residue_embeddings(model, alphabet, sequences, batch_size=BATCH_SIZE):
68
+ """
69
+ Compute per-residue ESM-2 embeddings for a list of (id, seq).
70
+ Pads or truncates each embedding to shape [MAX_SEQ_LEN, D].
71
+ Returns a dict {seq_id: tensor[MAX_SEQ_LEN, D]} on CPU.
72
+ """
73
+ model.eval()
74
+ converter = alphabet.get_batch_converter()
75
+ embeddings = {}
76
+
77
+ print(f"Computing embeddings for {len(sequences)} sequences...")
78
+ for i in tqdm(range(0, len(sequences), batch_size), desc="Computing embeddings"):
79
+ batch = sequences[i:i+batch_size]
80
+ labels, seqs = zip(*batch)
81
+ _, _, tokens = converter(batch)
82
+ tokens = tokens.to(DEVICE)
83
+
84
+ out = model(tokens, repr_layers=[33], return_contacts=False)
85
+ reps = out['representations'][33] # [B, L+2, D]
86
+
87
+ for idx, sid in enumerate(labels):
88
+ seq = seqs[idx]
89
+ L = len(seq)
90
+ # take per-residue embeddings and pad/truncate
91
+ emb = reps[idx, 1:1+L, :] # Remove CLS and EOS tokens
92
+ if L < MAX_SEQ_LEN:
93
+ pad_len = MAX_SEQ_LEN - L
94
+ emb = F.pad(emb, (0, 0, 0, pad_len))
95
+ elif L > MAX_SEQ_LEN:
96
+ emb = emb[:MAX_SEQ_LEN, :]
97
+ embeddings[sid] = emb.cpu()
98
+
99
+ return embeddings
100
+
101
+ def save_embeddings_for_compressor(embeddings, output_dir="/data2/edwardsun/flow_project/peptide_embeddings"):
102
+ """
103
+ Save embeddings in a format compatible with the compressor.
104
+ Creates both individual files and a combined tensor.
105
+ """
106
+ os.makedirs(output_dir, exist_ok=True)
107
+
108
+ # Save individual embeddings
109
+ print(f"Saving individual embeddings to {output_dir}/...")
110
+ for seq_id, emb in tqdm(embeddings.items(), desc="Saving individual files"):
111
+ torch.save(emb, os.path.join(output_dir, f"{seq_id}.pt"))
112
+
113
+ # Create and save combined tensor for compressor
114
+ print("Creating combined tensor...")
115
+ all_embeddings = []
116
+ seq_ids = []
117
+
118
+ for seq_id, emb in embeddings.items():
119
+ all_embeddings.append(emb)
120
+ seq_ids.append(seq_id)
121
+
122
+ # Stack all embeddings
123
+ combined_embeddings = torch.stack(all_embeddings) # [N, MAX_SEQ_LEN, D]
124
+
125
+ # Save combined tensor
126
+ combined_path = os.path.join(output_dir, "all_peptide_embeddings.pt")
127
+ torch.save(combined_embeddings, combined_path)
128
+
129
+ # Save sequence IDs for reference
130
+ seq_ids_path = os.path.join(output_dir, "sequence_ids.json")
131
+ with open(seq_ids_path, 'w') as f:
132
+ json.dump(seq_ids, f, indent=2)
133
+
134
+ # Save metadata
135
+ metadata = {
136
+ "num_sequences": len(embeddings),
137
+ "embedding_dim": combined_embeddings.shape[-1],
138
+ "max_seq_len": MAX_SEQ_LEN,
139
+ "device_used": str(DEVICE),
140
+ "model_name": "esm2_t33_650M_UR50D"
141
+ }
142
+ metadata_path = os.path.join(output_dir, "metadata.json")
143
+ with open(metadata_path, 'w') as f:
144
+ json.dump(metadata, f, indent=2)
145
+
146
+ print(f"Saved combined embeddings: {combined_path}")
147
+ print(f"Combined tensor shape: {combined_embeddings.shape}")
148
+ print(f"Memory usage: {combined_embeddings.element_size() * combined_embeddings.nelement() / 1e6:.1f} MB")
149
+
150
+ return combined_path
151
+
152
+ def create_compressor_dataset(embeddings, output_dir="/data2/edwardsun/flow_project/compressor_dataset"):
153
+ """
154
+ Create a dataset format specifically for the compressor training.
155
+ """
156
+ os.makedirs(output_dir, exist_ok=True)
157
+
158
+ # Stack all embeddings
159
+ all_embeddings = torch.stack(list(embeddings.values()))
160
+
161
+ # Save as numpy array for easy loading
162
+ np_path = os.path.join(output_dir, "peptide_embeddings.npy")
163
+ np.save(np_path, all_embeddings.numpy())
164
+
165
+ # Save as torch tensor
166
+ torch_path = os.path.join(output_dir, "peptide_embeddings.pt")
167
+ torch.save(all_embeddings, torch_path)
168
+
169
+ print(f"Created compressor dataset:")
170
+ print(f" Shape: {all_embeddings.shape}")
171
+ print(f" Numpy: {np_path}")
172
+ print(f" Torch: {torch_path}")
173
+
174
+ return torch_path
175
+
176
+ # ---------------- Main Execution ----------------
177
+ if __name__ == '__main__':
178
+ # 1. Load model & tokenizer
179
+ print("Loading ESM-2 model...")
180
+ model_name = 'esm2_t33_650M_UR50D'
181
+ model, alphabet = esm.pretrained.load_model_and_alphabet(model_name)
182
+ model = model.to(DEVICE)
183
+ print(f"Loaded {model_name}")
184
+
185
+ # 2. Read and filter sequences from peptides JSON
186
+ json_file = 'all_peptides_data.json'
187
+ sequences = read_peptides_json(json_file)
188
+ print(f"Loaded {len(sequences)} valid sequences from {json_file}")
189
+
190
+ if len(sequences) == 0:
191
+ print("No valid sequences found. Exiting.")
192
+ exit(1)
193
+
194
+ # 3. Compute per-residue embeddings
195
+ embeddings = get_per_residue_embeddings(model, alphabet, sequences)
196
+
197
+ # 4. Save embeddings in multiple formats
198
+ print("\nSaving embeddings...")
199
+
200
+ # Save individual files and combined tensor
201
+ combined_path = save_embeddings_for_compressor(embeddings)
202
+
203
+ # Create compressor-specific dataset
204
+ compressor_path = create_compressor_dataset(embeddings)
205
+
206
+ print(f"\n✓ Successfully processed {len(embeddings)} peptide sequences")
207
+ print(f"✓ Embeddings saved and ready for compressor training")
208
+ print(f"✓ Use '{compressor_path}' in your compressor.py file")
209
+
210
+ # Show some statistics
211
+ sample_emb = next(iter(embeddings.values()))
212
+ print(f"\nEmbedding statistics:")
213
+ print(f" Individual embedding shape: {sample_emb.shape}")
214
+ print(f" Embedding dimension: {sample_emb.shape[-1]}")
215
+ print(f" Data type: {sample_emb.dtype}")
generate_amps.py ADDED
@@ -0,0 +1,215 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import torch
2
+ import torch.nn.functional as F
3
+ import numpy as np
4
+ from tqdm import tqdm
5
+ import os
6
+ from datetime import datetime
7
+
8
+ # Import your components
9
+ from compressor_with_embeddings import Compressor, Decompressor
10
+ from final_flow_model import AMPFlowMatcherCFGConcat, AMPProtFlowPipelineCFG
11
+
12
+ class AMPGenerator:
13
+ """
14
+ Generate AMP samples using trained ProtFlow model.
15
+ """
16
+
17
+ def __init__(self, model_path, device='cuda'):
18
+ self.device = device
19
+
20
+ # Load models
21
+ self._load_models(model_path)
22
+
23
+ # Load preprocessing statistics
24
+ self.stats = torch.load('normalization_stats.pt', map_location=device)
25
+
26
+ def _load_models(self, model_path):
27
+ """Load trained models."""
28
+ print("Loading trained models...")
29
+
30
+ # Load compressor and decompressor
31
+ self.compressor = Compressor().to(self.device)
32
+ self.decompressor = Decompressor().to(self.device)
33
+
34
+ self.compressor.load_state_dict(torch.load('/data2/edwardsun/flow_amp/models/final_compressor_model.pth', map_location=self.device))
35
+ self.decompressor.load_state_dict(torch.load('/data2/edwardsun/flow_amp/models/final_decompressor_model.pth', map_location=self.device))
36
+
37
+ # Load flow matching model with CFG
38
+ self.flow_model = AMPFlowMatcherCFGConcat(
39
+ hidden_dim=480,
40
+ compressed_dim=80, # 1280 // 16
41
+ n_layers=12,
42
+ n_heads=16,
43
+ dim_ff=3072,
44
+ max_seq_len=25,
45
+ use_cfg=True
46
+ ).to(self.device)
47
+
48
+ checkpoint = torch.load(model_path, map_location=self.device)
49
+
50
+ # Handle PyTorch compilation wrapper
51
+ state_dict = checkpoint['flow_model_state_dict']
52
+ new_state_dict = {}
53
+
54
+ for key, value in state_dict.items():
55
+ # Remove _orig_mod prefix if present
56
+ if key.startswith('_orig_mod.'):
57
+ new_key = key[10:] # Remove '_orig_mod.' prefix
58
+ else:
59
+ new_key = key
60
+ new_state_dict[new_key] = value
61
+
62
+ self.flow_model.load_state_dict(new_state_dict)
63
+
64
+ print(f"✓ All models loaded successfully from step {checkpoint['step']}!")
65
+ print(f" Loss at checkpoint: {checkpoint['loss']:.6f}")
66
+
67
+ def generate_amps(self, num_samples=100, num_steps=25, batch_size=32, cfg_scale=7.5):
68
+ """
69
+ Generate AMP samples using flow matching with CFG.
70
+
71
+ Args:
72
+ num_samples: Number of AMP samples to generate
73
+ num_steps: Number of ODE solving steps (25 for good quality, 1 for reflow)
74
+ batch_size: Batch size for generation
75
+ cfg_scale: CFG guidance scale (higher = stronger conditioning)
76
+ """
77
+ print(f"Generating {num_samples} AMP samples with {num_steps} steps (CFG scale: {cfg_scale})...")
78
+
79
+ self.flow_model.eval()
80
+ self.compressor.eval()
81
+ self.decompressor.eval()
82
+
83
+ all_generated = []
84
+
85
+ with torch.no_grad():
86
+ for i in tqdm(range(0, num_samples, batch_size), desc="Generating"):
87
+ current_batch = min(batch_size, num_samples - i)
88
+
89
+ # Sample random noise
90
+ eps = torch.randn(current_batch, 25, 80, device=self.device) # [B, L', COMP_DIM]
91
+
92
+ # ODE solving steps with CFG
93
+ xt = eps.clone()
94
+ amp_labels = torch.full((current_batch,), 0, device=self.device) # 0 = AMP
95
+ mask_labels = torch.full((current_batch,), 2, device=self.device) # 2 = Mask
96
+
97
+ for step in range(num_steps):
98
+ t = torch.ones(current_batch, device=self.device) * (1.0 - step/num_steps)
99
+
100
+ # CFG: Generate with condition and without condition
101
+ if cfg_scale > 0:
102
+ # With AMP condition
103
+ vt_cond = self.flow_model(xt, t, labels=amp_labels)
104
+
105
+ # Without condition (mask)
106
+ vt_uncond = self.flow_model(xt, t, labels=mask_labels)
107
+
108
+ # CFG interpolation
109
+ vt = vt_uncond + cfg_scale * (vt_cond - vt_uncond)
110
+ else:
111
+ # No CFG, use mask label
112
+ vt = self.flow_model(xt, t, labels=mask_labels)
113
+
114
+ # Euler step for backward integration (t: 1 -> 0)
115
+ # Use negative dt to integrate backward from noise to data
116
+ dt = -1.0 / num_steps
117
+ xt = xt + vt * dt
118
+
119
+ # Decompress to get embeddings
120
+ decompressed = self.decompressor(xt) # [B, L, ESM_DIM]
121
+
122
+ # Apply reverse preprocessing
123
+ m, s, mn, mx = self.stats['mean'], self.stats['std'], self.stats['min'], self.stats['max']
124
+ decompressed = decompressed * (mx - mn + 1e-8) + mn
125
+ decompressed = decompressed * s + m
126
+
127
+ all_generated.append(decompressed.cpu())
128
+
129
+ # Concatenate all batches
130
+ generated_embeddings = torch.cat(all_generated, dim=0)
131
+
132
+ print(f"✓ Generated {generated_embeddings.shape[0]} AMP embeddings")
133
+ print(f" Shape: {generated_embeddings.shape}")
134
+ print(f" Stats - Mean: {generated_embeddings.mean():.4f}, Std: {generated_embeddings.std():.4f}")
135
+
136
+ return generated_embeddings
137
+
138
+ def generate_with_reflow(self, num_samples=100):
139
+ """
140
+ Generate AMP samples using 1-step reflow (if you have reflow model).
141
+ """
142
+ print(f"Generating {num_samples} AMP samples with 1-step reflow...")
143
+
144
+ # This would use the reflow implementation
145
+ # For now, just use 1-step generation
146
+ return self.generate_amps(num_samples=num_samples, num_steps=1, batch_size=32)
147
+
148
+ def main():
149
+ """Main generation function."""
150
+ print("=== AMP Generation Pipeline with CFG ===")
151
+
152
+ # Use the best model from training
153
+ model_path = '/data2/edwardsun/flow_amp/checkpoints/amp_flow_model_best_optimized.pth'
154
+
155
+ # Check if checkpoint exists
156
+ try:
157
+ checkpoint = torch.load(model_path, map_location='cpu')
158
+ print(f"✓ Found best model at step {checkpoint['step']} with loss {checkpoint['loss']:.6f}")
159
+ print(f" Global step: {checkpoint['global_step']}")
160
+ print(f" Total samples: {checkpoint['total_samples']:,}")
161
+ except:
162
+ print(f"❌ Best model not found: {model_path}")
163
+ print("Please train the flow matching model first using amp_flow_training.py")
164
+ return
165
+
166
+ # Initialize generator
167
+ generator = AMPGenerator(model_path, device='cuda')
168
+
169
+ # Generate samples with different CFG scales
170
+ print("\n1. Generating with CFG scale 0.0 (no conditioning)...")
171
+ samples_no_cfg = generator.generate_amps(num_samples=20, num_steps=25, cfg_scale=0.0)
172
+
173
+ print("\n2. Generating with CFG scale 3.0 (weak conditioning)...")
174
+ samples_weak_cfg = generator.generate_amps(num_samples=20, num_steps=25, cfg_scale=3.0)
175
+
176
+ print("\n3. Generating with CFG scale 7.5 (strong conditioning)...")
177
+ samples_strong_cfg = generator.generate_amps(num_samples=20, num_steps=25, cfg_scale=7.5)
178
+
179
+ print("\n4. Generating with CFG scale 15.0 (very strong conditioning)...")
180
+ samples_very_strong_cfg = generator.generate_amps(num_samples=20, num_steps=25, cfg_scale=15.0)
181
+
182
+ # Create output directory if it doesn't exist
183
+ output_dir = '/data2/edwardsun/generated_samples'
184
+ os.makedirs(output_dir, exist_ok=True)
185
+
186
+ # Get today's date for filename
187
+ today = datetime.now().strftime('%Y%m%d')
188
+
189
+ # Save generated samples with date
190
+ torch.save(samples_no_cfg, os.path.join(output_dir, f'generated_amps_best_model_no_cfg_{today}.pt'))
191
+ torch.save(samples_weak_cfg, os.path.join(output_dir, f'generated_amps_best_model_weak_cfg_{today}.pt'))
192
+ torch.save(samples_strong_cfg, os.path.join(output_dir, f'generated_amps_best_model_strong_cfg_{today}.pt'))
193
+ torch.save(samples_very_strong_cfg, os.path.join(output_dir, f'generated_amps_best_model_very_strong_cfg_{today}.pt'))
194
+
195
+ print("\n✓ Generation complete!")
196
+ print(f"Generated samples saved (Date: {today}):")
197
+ print(f" - generated_amps_best_model_no_cfg_{today}.pt (no conditioning)")
198
+ print(f" - generated_amps_best_model_weak_cfg_{today}.pt (weak CFG)")
199
+ print(f" - generated_amps_best_model_strong_cfg_{today}.pt (strong CFG)")
200
+ print(f" - generated_amps_best_model_very_strong_cfg_{today}.pt (very strong CFG)")
201
+
202
+ print("\nCFG Analysis:")
203
+ print(" - CFG scale 0.0: No conditioning, generates diverse sequences")
204
+ print(" - CFG scale 3.0: Weak AMP conditioning")
205
+ print(" - CFG scale 7.5: Strong AMP conditioning (recommended)")
206
+ print(" - CFG scale 15.0: Very strong AMP conditioning (may be too restrictive)")
207
+
208
+ print("\nNext steps:")
209
+ print("1. Decode embeddings back to sequences using ESM-2 decoder")
210
+ print("2. Evaluate AMP properties (antimicrobial activity, toxicity)")
211
+ print("3. Compare sequences generated with different CFG scales")
212
+ print("4. Implement conditioning for specific properties")
213
+
214
+ if __name__ == "__main__":
215
+ main()
launch_full_data_training.sh ADDED
@@ -0,0 +1,118 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/bin/bash
2
+
3
+ # Optimized Single GPU AMP Flow Matching Training Launch Script with FULL DATA
4
+ # This script launches optimized training on GPU 3 using ALL available data
5
+ # Features: Mixed precision (BF16), increased batch size, H100 optimizations
6
+
7
+ echo "=== Launching Optimized Single GPU AMP Flow Matching Training with FULL DATA ==="
8
+ echo "Using GPU 3 for training (other GPUs are busy)"
9
+ echo "Using ALL available peptide embeddings and UniProt data"
10
+ echo "OVERNIGHT TRAINING: 15000 iterations with CFG support and H100 optimizations"
11
+ echo ""
12
+
13
+ # Check if required files exist
14
+ echo "Checking required files..."
15
+ if [ ! -f "final_compressor_model.pth" ]; then
16
+ echo "❌ Missing final_compressor_model.pth"
17
+ echo "Please run compressor_with_embeddings.py first"
18
+ exit 1
19
+ fi
20
+
21
+ if [ ! -f "final_decompressor_model.pth" ]; then
22
+ echo "❌ Missing final_decompressor_model.pth"
23
+ echo "Please run compressor_with_embeddings.py first"
24
+ exit 1
25
+ fi
26
+
27
+ if [ ! -d "/data2/edwardsun/flow_project/peptide_embeddings/" ]; then
28
+ echo "❌ Missing /data2/edwardsun/flow_project/peptide_embeddings/ directory"
29
+ echo "Please run final_sequence_encoder.py first"
30
+ exit 1
31
+ fi
32
+
33
+ # Check for full data files
34
+ if [ ! -f "/data2/edwardsun/flow_project/peptide_embeddings/all_peptide_embeddings.pt" ]; then
35
+ echo "⚠️ Warning: all_peptide_embeddings.pt not found"
36
+ echo "Will use individual embedding files instead"
37
+ else
38
+ echo "✓ Found all_peptide_embeddings.pt (4.3GB - ALL peptide data)"
39
+ fi
40
+
41
+ if [ ! -f "/data2/edwardsun/flow_project/test_uniprot_processed/uniprot_processed_data.json" ]; then
42
+ echo "❌ Missing /data2/edwardsun/flow_project/test_uniprot_processed/uniprot_processed_data.json"
43
+ echo "This contains ALL UniProt data for CFG training"
44
+ exit 1
45
+ else
46
+ echo "✓ Found uniprot_processed_data.json (3.4GB - ALL UniProt data)"
47
+ fi
48
+
49
+ echo "✓ All required files found!"
50
+ echo ""
51
+
52
+ # Set CUDA device to GPU 3
53
+ export CUDA_VISIBLE_DEVICES=3
54
+
55
+ # Enable H100 optimizations
56
+ export TORCH_CUDNN_V8_API_ENABLED=1
57
+ export TORCH_CUDNN_V8_API_DISABLED=0
58
+
59
+ echo "=== Optimized Training Configuration ==="
60
+ echo " - GPU: 3 (CUDA_VISIBLE_DEVICES=3)"
61
+ echo " - Batch size: 96 (optimized based on profiling)"
62
+ echo " - Total iterations: 6,000"
63
+ echo " - Mixed precision: BF16 (H100 optimized)"
64
+ echo " - Learning rate: 4e-4 -> 2e-4 (cosine annealing)"
65
+ echo " - Warmup steps: 5,000"
66
+ echo " - Gradient clipping: 1.0"
67
+ echo " - Weight decay: 0.01"
68
+ echo " - Data workers: 16"
69
+ echo " - CFG dropout: 15%"
70
+ echo " - Validation: Every 10,000 steps"
71
+ echo " - Checkpoints: Every 1,000 epochs"
72
+ echo " - Estimated time: ~8-10 hours (overnight training)"
73
+ echo ""
74
+
75
+ # Check GPU memory and capabilities
76
+ echo "Checking GPU capabilities..."
77
+ nvidia-smi --query-gpu=name,memory.total,memory.free --format=csv,noheader,nounits | while IFS=, read -r name total free; do
78
+ echo " GPU: $name"
79
+ echo " Total memory: ${total}MB"
80
+ echo " Free memory: ${free}MB"
81
+ echo " Available: $((free * 100 / total))%"
82
+ done
83
+
84
+ echo ""
85
+
86
+ # Launch optimized training
87
+ echo "Starting optimized single GPU training on GPU 3 with FULL DATA..."
88
+ echo ""
89
+
90
+ # Launch training with optional wandb logging
91
+ # Uncomment the following line if you want to use wandb logging:
92
+ # python amp_flow_training_single_gpu_full_data.py --use_wandb
93
+
94
+ # Standard training without wandb
95
+ python amp_flow_training_single_gpu_full_data.py
96
+
97
+ echo ""
98
+ echo "=== Optimized Overnight Training Complete with FULL DATA ==="
99
+ echo "Check for output files:"
100
+ echo " - amp_flow_model_best_optimized.pth (best validation model)"
101
+ echo " - amp_flow_model_final_optimized.pth (final model)"
102
+ echo " - amp_flow_checkpoint_optimized_step_*.pth (checkpoints every 1000 epochs)"
103
+ echo ""
104
+ echo "Training optimizations applied:"
105
+ echo " ✓ Mixed precision (BF16) for ~30-50% speedup"
106
+ echo " ✓ Increased batch size (128) for better H100 utilization"
107
+ echo " ✓ Optimized learning rate schedule with proper warmup"
108
+ echo " ✓ Gradient clipping for training stability"
109
+ echo " ✓ CFG dropout for better guidance"
110
+ echo " ✓ Validation monitoring and early stopping"
111
+ echo " ✓ PyTorch 2.x compilation for speedup"
112
+ echo ""
113
+ echo "Next steps:"
114
+ echo "1. Test the optimized model: python generate_amps.py"
115
+ echo "2. Compare performance with previous model"
116
+ echo "3. Implement reflow for 1-step generation"
117
+ echo "4. Add conditioning for toxicity"
118
+ echo "5. Fine-tune on specific AMP properties"
launch_multi_gpu_training.sh ADDED
@@ -0,0 +1,85 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/bin/bash
2
+
3
+ # Multi-GPU AMP Flow Matching Training Launch Script
4
+ # This script launches distributed training across 4 H100 GPUs
5
+
6
+ echo "=== Launching Multi-GPU AMP Flow Matching Training with FULL DATA ==="
7
+ echo "Using 4 H100 GPUs for distributed training"
8
+ echo "Using ALL available peptide embeddings and UniProt data"
9
+ echo "EXTENDED TRAINING: 5000 iterations with CFG support"
10
+ echo ""
11
+
12
+ # Check if required files exist
13
+ echo "Checking required files..."
14
+ if [ ! -f "final_compressor_model.pth" ]; then
15
+ echo "❌ Missing final_compressor_model.pth"
16
+ echo "Please run compressor_with_embeddings.py first"
17
+ exit 1
18
+ fi
19
+
20
+ if [ ! -f "final_decompressor_model.pth" ]; then
21
+ echo "❌ Missing final_decompressor_model.pth"
22
+ echo "Please run compressor_with_embeddings.py first"
23
+ exit 1
24
+ fi
25
+
26
+ if [ ! -d "/data2/edwardsun/flow_project/peptide_embeddings/" ]; then
27
+ echo "❌ Missing /data2/edwardsun/flow_project/peptide_embeddings/ directory"
28
+ echo "Please run final_sequence_encoder.py first"
29
+ exit 1
30
+ fi
31
+
32
+ # Check for full data files
33
+ if [ ! -f "/data2/edwardsun/flow_project/peptide_embeddings/all_peptide_embeddings.pt" ]; then
34
+ echo "⚠️ Warning: all_peptide_embeddings.pt not found"
35
+ echo "Will use individual embedding files instead"
36
+ else
37
+ echo "✓ Found all_peptide_embeddings.pt (4.3GB - ALL peptide data)"
38
+ fi
39
+
40
+ # Check if there are embedding files in the directory (fallback)
41
+ if [ ! "$(ls -A /data2/edwardsun/flow_project/peptide_embeddings/*.pt 2>/dev/null)" ]; then
42
+ echo "❌ No .pt files found in /data2/edwardsun/flow_project/peptide_embeddings/ directory"
43
+ echo "Please run final_sequence_encoder.py first"
44
+ exit 1
45
+ fi
46
+
47
+ echo "✓ All required files found!"
48
+ echo ""
49
+
50
+ # Set environment variables for distributed training
51
+ export NCCL_DEBUG=INFO
52
+ export NCCL_IB_DISABLE=0
53
+ export NCCL_P2P_DISABLE=0
54
+
55
+ # Launch distributed training
56
+ echo "Starting distributed training with torchrun..."
57
+ echo "Configuration (FULL DATA TRAINING):"
58
+ echo " - Number of GPUs: 4"
59
+ echo " - Batch size per GPU: 64"
60
+ echo " - Total batch size: 256"
61
+ echo " - Total iterations: 5,000"
62
+ echo " - Data: ALL peptide embeddings + ALL UniProt data"
63
+ echo " - Estimated time: ~30-45 minutes (4x faster than single GPU)"
64
+ echo ""
65
+
66
+ # Launch with torchrun
67
+ torchrun \
68
+ --nproc_per_node=4 \
69
+ --nnodes=1 \
70
+ --node_rank=0 \
71
+ --master_addr=localhost \
72
+ --master_port=29500 \
73
+ amp_flow_training_multi_gpu.py
74
+
75
+ echo ""
76
+ echo "=== Training Complete with FULL DATA ==="
77
+ echo "Check for output files:"
78
+ echo " - amp_flow_model_final_full_data.pth (final model with full data)"
79
+ echo " - amp_flow_checkpoint_full_data_step_*.pth (checkpoints)"
80
+ echo ""
81
+ echo "Next steps:"
82
+ echo "1. Test the model: python generate_amps.py"
83
+ echo "2. If successful, increase iterations for full training"
84
+ echo "3. Implement reflow for 1-step generation"
85
+ echo "4. Add conditioning for toxicity"
model_card.md ADDED
@@ -0,0 +1,127 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ tags:
5
+ - protein-design
6
+ - antimicrobial-peptides
7
+ - flow-matching
8
+ - esm-2
9
+ - pytorch
10
+ license: mit
11
+ datasets:
12
+ - uniprot
13
+ - amp-datasets
14
+ metrics:
15
+ - mic-prediction
16
+ - sequence-validity
17
+ - diversity
18
+ ---
19
+
20
+ # FlowAMP: Flow-based Antimicrobial Peptide Generation
21
+
22
+ ## Model Description
23
+
24
+ FlowAMP is a novel flow-based generative model for designing antimicrobial peptides (AMPs) using conditional flow matching and ESM-2 protein language model embeddings. The model leverages the power of flow matching for high-quality peptide generation while incorporating protein language model understanding for biologically relevant sequences.
25
+
26
+ ### Architecture
27
+
28
+ The model consists of several key components:
29
+
30
+ 1. **ESM-2 Encoder**: Uses ESM-2 (esm2_t33_650M_UR50D) to extract 1280-dimensional protein sequence embeddings
31
+ 2. **Compressor/Decompressor**: Reduces embedding dimensionality by 16x (1280 → 80) for efficient processing
32
+ 3. **Flow Matcher**: Implements conditional flow matching for generation with time embeddings
33
+ 4. **CFG Integration**: Classifier-free guidance for controllable generation
34
+
35
+ ### Key Features
36
+
37
+ - **Flow-based Generation**: Uses conditional flow matching for high-quality peptide generation
38
+ - **ESM-2 Integration**: Leverages ESM-2 protein language model embeddings for sequence understanding
39
+ - **CFG Training**: Implements Classifier-Free Guidance for controllable generation
40
+ - **Multi-GPU Training**: Optimized for H100 GPUs with mixed precision training
41
+ - **Comprehensive Evaluation**: MIC prediction and antimicrobial activity assessment
42
+
43
+ ## Training
44
+
45
+ ### Training Data
46
+
47
+ The model was trained on:
48
+ - **UniProt Database**: Comprehensive protein sequence database
49
+ - **AMP Datasets**: Curated antimicrobial peptide sequences
50
+ - **ESM-2 Embeddings**: Pre-computed embeddings for efficient training
51
+
52
+ ### Training Configuration
53
+
54
+ - **Batch Size**: 96 (optimized for H100)
55
+ - **Learning Rate**: 4e-4 with cosine annealing to 2e-4
56
+ - **Epochs**: 6000
57
+ - **Mixed Precision**: BF16 for H100 optimization
58
+ - **CFG Dropout**: 15% for unconditional training
59
+ - **Gradient Clipping**: Norm=1.0 for stability
60
+
61
+ ### Training Performance
62
+
63
+ - **Speed**: 31 steps/second on H100 GPU
64
+ - **Memory Efficiency**: Mixed precision training
65
+ - **Stability**: Gradient clipping and weight decay (0.01)
66
+
67
+ ## Usage
68
+
69
+ ### Basic Generation
70
+
71
+ ```python
72
+ from final_flow_model import AMPFlowMatcherCFGConcat
73
+ from generate_amps import generate_amps
74
+
75
+ # Load trained model
76
+ model = AMPFlowMatcherCFGConcat.load_from_checkpoint('path/to/checkpoint.pth')
77
+
78
+ # Generate AMPs with different CFG strengths
79
+ sequences_no_cfg = generate_amps(model, num_samples=100, cfg_strength=0.0)
80
+ sequences_weak_cfg = generate_amps(model, num_samples=100, cfg_strength=1.0)
81
+ sequences_strong_cfg = generate_amps(model, num_samples=100, cfg_strength=2.0)
82
+ sequences_very_strong_cfg = generate_amps(model, num_samples=100, cfg_strength=3.0)
83
+ ```
84
+
85
+ ### Evaluation
86
+
87
+ ```python
88
+ from test_generated_peptides import evaluate_generated_peptides
89
+
90
+ # Evaluate generated sequences for antimicrobial activity
91
+ results = evaluate_generated_peptides(sequences)
92
+ ```
93
+
94
+ ## Performance
95
+
96
+ ### Generation Quality
97
+
98
+ - **Sequence Validity**: High percentage of valid peptide sequences
99
+ - **Diversity**: Good sequence diversity across different CFG strengths
100
+ - **Biological Relevance**: ESM-2 embeddings ensure biologically meaningful sequences
101
+
102
+ ### Antimicrobial Activity
103
+
104
+ - **MIC Prediction**: Integration with Apex model for MIC prediction
105
+ - **Activity Assessment**: Comprehensive evaluation of antimicrobial potential
106
+ - **CFG Effectiveness**: Measured through controlled generation
107
+
108
+ ## Limitations
109
+
110
+ - **Sequence Length**: Limited to 50 amino acids maximum
111
+ - **Computational Requirements**: Requires GPU for efficient generation
112
+ - **Training Data**: Dependent on quality of UniProt and AMP datasets
113
+
114
+ ## Citation
115
+
116
+ ```bibtex
117
+ @article{flowamp2024,
118
+ title={FlowAMP: Flow-based Antimicrobial Peptide Generation with Conditional Flow Matching},
119
+ author={Sun, Edward},
120
+ journal={arXiv preprint},
121
+ year={2024}
122
+ }
123
+ ```
124
+
125
+ ## License
126
+
127
+ MIT License - see LICENSE file for details.
monitor_training.sh ADDED
@@ -0,0 +1,53 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/bin/bash
2
+
3
+ echo "=== AMP Flow Training Monitor ==="
4
+ echo "Timestamp: $(date)"
5
+ echo ""
6
+
7
+ # Check if training process is running
8
+ echo "1. Process Status:"
9
+ if pgrep -f "amp_flow_training_single_gpu_full_data.py" > /dev/null; then
10
+ echo "✓ Training process is running"
11
+ PID=$(pgrep -f "amp_flow_training_single_gpu_full_data.py")
12
+ echo " PID: $PID"
13
+ echo " Runtime: $(ps -o etime= -p $PID)"
14
+ else
15
+ echo "❌ Training process not found"
16
+ exit 1
17
+ fi
18
+
19
+ echo ""
20
+
21
+ # Check GPU usage
22
+ echo "2. GPU Usage:"
23
+ nvidia-smi --query-gpu=index,name,utilization.gpu,memory.used,memory.total --format=csv,noheader,nounits | while IFS=, read -r idx name util mem_used mem_total; do
24
+ echo " GPU $idx ($name): $util% | ${mem_used}MB/${mem_total}MB"
25
+ done
26
+
27
+ echo ""
28
+
29
+ # Check log file
30
+ echo "3. Recent Log Output:"
31
+ if [ -f "overnight_training.log" ]; then
32
+ echo " Log file size: $(du -h overnight_training.log | cut -f1)"
33
+ echo " Last 5 lines:"
34
+ tail -5 overnight_training.log | sed 's/^/ /'
35
+ else
36
+ echo " ❌ Log file not found"
37
+ fi
38
+
39
+ echo ""
40
+
41
+ # Check for checkpoint files
42
+ echo "4. Checkpoint Files:"
43
+ if [ -d "/data2/edwardsun/flow_checkpoints" ]; then
44
+ echo " Checkpoint directory: /data2/edwardsun/flow_checkpoints"
45
+ ls -la /data2/edwardsun/flow_checkpoints/*.pth 2>/dev/null | wc -l | xargs echo " Number of checkpoints:"
46
+ echo " Latest checkpoint:"
47
+ ls -t /data2/edwardsun/flow_checkpoints/*.pth 2>/dev/null | head -1 | xargs -I {} basename {} 2>/dev/null || echo " None yet"
48
+ else
49
+ echo " ❌ Checkpoint directory not found"
50
+ fi
51
+
52
+ echo ""
53
+ echo "=== End Monitor ==="
normalization_stats.pt ADDED
Binary file (12.3 kB). View file
 
requirements.txt ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ torch>=2.0.0
2
+ transformers>=4.20.0
3
+ numpy>=1.21.0
4
+ tqdm>=4.64.0
5
+ wandb>=0.13.0
6
+ pandas>=1.5.0
7
+ scikit-learn>=1.1.0
8
+ matplotlib>=3.5.0
9
+ seaborn>=0.11.0
requirements.yaml ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ name: mdlm
2
+ channels:
3
+ - pytorch
4
+ - conda-forge
5
+ - defaults
6
+ dependencies:
7
+ - python=3.9
8
+ - pytorch=2.1.0
9
+ - torchvision
10
+ - torchaudio
11
+ - pytorch-cuda=11.8
12
+ - cudatoolkit=11.8
13
+ - pip
14
+ - pip:
15
+ - fair-esm
16
+ - transformers
17
+ - datasets
18
+ - accelerate
19
+ - wandb
20
+ - tqdm
21
+ - numpy
22
+ - scipy
23
+ - scikit-learn
24
+ - matplotlib
25
+ - seaborn
26
+ - pandas
27
+ - biopython
28
+ - h5py
29
+ - tensorboard
30
+ - jupyter
31
+ - ipykernel
test_generated_peptides.py ADDED
@@ -0,0 +1,383 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import torch
2
+ import numpy as np
3
+ import json
4
+ import os
5
+ from tqdm import tqdm
6
+ import warnings
7
+ from datetime import datetime
8
+ warnings.filterwarnings('ignore')
9
+
10
+ # Import our components
11
+ from generate_amps import AMPGenerator
12
+ from compressor_with_embeddings import Compressor, Decompressor
13
+ from final_sequence_decoder import EmbeddingToSequenceConverter
14
+
15
+ # Import local APEX wrapper
16
+ try:
17
+ from local_apex_wrapper import LocalAPEXWrapper
18
+ APEX_AVAILABLE = True
19
+ except ImportError as e:
20
+ print(f"Warning: Local APEX not available: {e}")
21
+ APEX_AVAILABLE = False
22
+
23
+ class PeptideTester:
24
+ """
25
+ Generate peptides and test them using APEX for antimicrobial activity.
26
+ """
27
+
28
+ def __init__(self, model_path='amp_flow_model_final.pth', device='cuda'):
29
+ self.device = device
30
+ self.model_path = model_path
31
+
32
+ # Initialize generator
33
+ print("Initializing peptide generator...")
34
+ self.generator = AMPGenerator(model_path, device)
35
+
36
+ # Initialize embedding to sequence converter
37
+ print("Initializing embedding to sequence converter...")
38
+ self.converter = EmbeddingToSequenceConverter(device)
39
+
40
+ # Initialize APEX if available
41
+ if APEX_AVAILABLE:
42
+ print("Initializing local APEX predictor...")
43
+ self.apex = LocalAPEXWrapper()
44
+ print("✓ Local APEX loaded successfully!")
45
+ else:
46
+ self.apex = None
47
+ print("⚠ Local APEX not available - will only generate sequences")
48
+
49
+ def generate_peptides(self, num_samples=100, num_steps=25, batch_size=32):
50
+ """
51
+ Generate peptide sequences using the trained flow model.
52
+ """
53
+ print(f"\n=== Generating {num_samples} Peptide Sequences ===")
54
+
55
+ # Generate embeddings
56
+ generated_embeddings = self.generator.generate_amps(
57
+ num_samples=num_samples,
58
+ num_steps=num_steps,
59
+ batch_size=batch_size
60
+ )
61
+
62
+ print(f"Generated embeddings shape: {generated_embeddings.shape}")
63
+
64
+ # Convert embeddings to sequences using the converter
65
+ sequences = self.converter.batch_embedding_to_sequences(generated_embeddings)
66
+
67
+ # Filter valid sequences
68
+ sequences = self.converter.filter_valid_sequences(sequences)
69
+
70
+ return sequences
71
+
72
+
73
+
74
+ def test_with_apex(self, sequences):
75
+ """
76
+ Test generated sequences using APEX for antimicrobial activity.
77
+ """
78
+ if not APEX_AVAILABLE:
79
+ print("⚠ APEX not available - skipping activity prediction")
80
+ return None
81
+
82
+ print(f"\n=== Testing {len(sequences)} Sequences with APEX ===")
83
+
84
+ results = []
85
+
86
+ for i, seq in tqdm(enumerate(sequences), desc="Testing with APEX"):
87
+ try:
88
+ # Predict antimicrobial activity using local APEX
89
+ avg_mic = self.apex.predict_single(seq)
90
+ is_amp = self.apex.is_amp(seq, threshold=32.0) # MIC threshold
91
+
92
+ result = {
93
+ 'sequence': seq,
94
+ 'sequence_id': f'generated_{i:04d}',
95
+ 'apex_score': avg_mic, # Lower MIC = better activity
96
+ 'is_amp': is_amp,
97
+ 'length': len(seq)
98
+ }
99
+ results.append(result)
100
+
101
+ except Exception as e:
102
+ print(f"Error testing sequence {i}: {e}")
103
+ continue
104
+
105
+ return results
106
+
107
+ def analyze_results(self, results):
108
+ """
109
+ Analyze the results of APEX testing.
110
+ """
111
+ if not results:
112
+ print("No results to analyze")
113
+ return
114
+
115
+ print(f"\n=== Analysis of {len(results)} Generated Peptides ===")
116
+
117
+ # Extract scores
118
+ scores = [r['apex_score'] for r in results]
119
+ amp_count = sum(1 for r in results if r['is_amp'])
120
+
121
+ print(f"Total sequences tested: {len(results)}")
122
+ print(f"Predicted AMPs: {amp_count} ({amp_count/len(results)*100:.1f}%)")
123
+ print(f"Average MIC: {np.mean(scores):.2f} μg/mL")
124
+ print(f"MIC range: {np.min(scores):.2f} - {np.max(scores):.2f} μg/mL")
125
+ print(f"MIC std: {np.std(scores):.2f} μg/mL")
126
+
127
+ # Show top candidates
128
+ top_candidates = sorted(results, key=lambda x: x['apex_score'], reverse=True)[:10]
129
+
130
+ print(f"\n=== Top 10 Candidates ===")
131
+ for i, candidate in enumerate(top_candidates):
132
+ print(f"{i+1:2d}. MIC: {candidate['apex_score']:.2f} μg/mL | "
133
+ f"Length: {candidate['length']:2d} | "
134
+ f"Sequence: {candidate['sequence']}")
135
+
136
+ return results
137
+
138
+ def save_results(self, results, filename='generated_peptides_results.json'):
139
+ """
140
+ Save results to JSON file.
141
+ """
142
+ if not results:
143
+ print("No results to save")
144
+ return
145
+
146
+ output = {
147
+ 'metadata': {
148
+ 'model_path': self.model_path,
149
+ 'num_sequences': len(results),
150
+ 'generation_timestamp': str(torch.cuda.Event() if torch.cuda.is_available() else 'cpu'),
151
+ 'apex_available': APEX_AVAILABLE
152
+ },
153
+ 'results': results
154
+ }
155
+
156
+ with open(filename, 'w') as f:
157
+ json.dump(output, f, indent=2)
158
+
159
+ print(f"✓ Results saved to {filename}")
160
+
161
+ def run_full_pipeline(self, num_samples=100, save_results=True):
162
+ """
163
+ Run the complete pipeline: generate peptides and test with APEX.
164
+ """
165
+ print("🚀 Starting Full Peptide Generation and Testing Pipeline")
166
+ print("=" * 60)
167
+
168
+ # Step 1: Generate peptides
169
+ sequences = self.generate_peptides(num_samples=num_samples)
170
+
171
+ # Step 2: Test with APEX
172
+ results = self.test_with_apex(sequences)
173
+
174
+ # Step 3: Analyze results
175
+ if results:
176
+ self.analyze_results(results)
177
+
178
+ # Step 4: Save results
179
+ if save_results:
180
+ self.save_results(results)
181
+
182
+ return results
183
+
184
+ def main():
185
+ """
186
+ Main function to test existing decoded sequence files with APEX.
187
+ """
188
+ print("🧬 AMP Flow Model - Testing Decoded Sequences with APEX")
189
+ print("=" * 60)
190
+
191
+ # Check if APEX is available
192
+ if not APEX_AVAILABLE:
193
+ print("❌ Local APEX not available - cannot test sequences")
194
+ print("Please ensure local_apex_wrapper.py is properly set up")
195
+ return
196
+
197
+ # Initialize tester (we only need APEX, not the generator)
198
+ print("Initializing APEX predictor...")
199
+ apex = LocalAPEXWrapper()
200
+ print("✓ Local APEX loaded successfully!")
201
+
202
+ # Get today's date for filename
203
+ today = datetime.now().strftime('%Y%m%d')
204
+
205
+ # Define the decoded sequence files to test (using today's generated sequences)
206
+ cfg_files = {
207
+ 'No CFG (0.0)': f'/data2/edwardsun/decoded_sequences/decoded_sequences_no_cfg_00_{today}.txt',
208
+ 'Weak CFG (3.0)': f'/data2/edwardsun/decoded_sequences/decoded_sequences_weak_cfg_30_{today}.txt',
209
+ 'Strong CFG (7.5)': f'/data2/edwardsun/decoded_sequences/decoded_sequences_strong_cfg_75_{today}.txt',
210
+ 'Very Strong CFG (15.0)': f'/data2/edwardsun/decoded_sequences/decoded_sequences_very_strong_cfg_150_{today}.txt'
211
+ }
212
+
213
+ all_results = {}
214
+
215
+ for cfg_name, file_path in cfg_files.items():
216
+ print(f"\n{'='*60}")
217
+ print(f"Testing {cfg_name} sequences...")
218
+ print(f"Loading: {file_path}")
219
+
220
+ if not os.path.exists(file_path):
221
+ print(f"❌ File not found: {file_path}")
222
+ continue
223
+
224
+ # Read sequences from file
225
+ sequences = []
226
+ with open(file_path, 'r') as f:
227
+ for line in f:
228
+ line = line.strip()
229
+ if line and not line.startswith('#') and '\t' in line:
230
+ # Parse sequence from tab-separated format
231
+ parts = line.split('\t')
232
+ if len(parts) >= 2:
233
+ seq = parts[1].strip()
234
+ if seq and len(seq) > 0:
235
+ sequences.append(seq)
236
+
237
+ print(f"✓ Loaded {len(sequences)} sequences from {file_path}")
238
+
239
+ # Test sequences with APEX
240
+ results = []
241
+ print(f"Testing {len(sequences)} sequences with APEX...")
242
+
243
+ for i, seq in tqdm(enumerate(sequences), desc=f"Testing {cfg_name}"):
244
+ try:
245
+ # Predict antimicrobial activity using local APEX
246
+ avg_mic = apex.predict_single(seq)
247
+ is_amp = apex.is_amp(seq, threshold=32.0) # MIC threshold
248
+
249
+ result = {
250
+ 'sequence': seq,
251
+ 'sequence_id': f'{cfg_name.lower().replace(" ", "_").replace("(", "").replace(")", "").replace(".", "")}_{i:03d}',
252
+ 'cfg_setting': cfg_name,
253
+ 'apex_score': avg_mic, # Lower MIC = better activity
254
+ 'is_amp': is_amp,
255
+ 'length': len(seq)
256
+ }
257
+ results.append(result)
258
+
259
+ except Exception as e:
260
+ print(f"Warning: Error testing sequence {i}: {e}")
261
+ continue
262
+
263
+ # Analyze results for this CFG setting
264
+ if results:
265
+ print(f"\n=== Analysis of {cfg_name} ===")
266
+ scores = [r['apex_score'] for r in results]
267
+ amp_count = sum(1 for r in results if r['is_amp'])
268
+
269
+ print(f"Total sequences tested: {len(results)}")
270
+ print(f"Predicted AMPs: {amp_count} ({amp_count/len(results)*100:.1f}%)")
271
+ print(f"Average MIC: {np.mean(scores):.2f} μg/mL")
272
+ print(f"MIC range: {np.min(scores):.2f} - {np.max(scores):.2f} μg/mL")
273
+ print(f"MIC std: {np.std(scores):.2f} μg/mL")
274
+
275
+ # Show top 5 candidates for this CFG setting
276
+ top_candidates = sorted(results, key=lambda x: x['apex_score'])[:5] # Lower MIC is better
277
+
278
+ print(f"\n=== Top 5 Candidates ({cfg_name}) ===")
279
+ for i, candidate in enumerate(top_candidates):
280
+ print(f"{i+1:2d}. MIC: {candidate['apex_score']:.2f} μg/mL | "
281
+ f"Length: {candidate['length']:2d} | "
282
+ f"Sequence: {candidate['sequence']}")
283
+
284
+ all_results[cfg_name] = results
285
+
286
+ # Create output directory if it doesn't exist
287
+ output_dir = '/data2/edwardsun/apex_results'
288
+ os.makedirs(output_dir, exist_ok=True)
289
+
290
+ # Save individual results with date
291
+ output_file = os.path.join(output_dir, f"apex_results_{cfg_name.lower().replace(' ', '_').replace('(', '').replace(')', '').replace('.', '')}_{today}.json")
292
+ with open(output_file, 'w') as f:
293
+ json.dump({
294
+ 'metadata': {
295
+ 'cfg_setting': cfg_name,
296
+ 'num_sequences': len(results),
297
+ 'apex_available': APEX_AVAILABLE
298
+ },
299
+ 'results': results
300
+ }, f, indent=2)
301
+ print(f"✓ Results saved to {output_file}")
302
+
303
+ # Overall comparison
304
+ print(f"\n{'='*60}")
305
+ print("OVERALL COMPARISON ACROSS CFG SETTINGS")
306
+ print(f"{'='*60}")
307
+
308
+ for cfg_name, results in all_results.items():
309
+ if results:
310
+ scores = [r['apex_score'] for r in results]
311
+ amp_count = sum(1 for r in results if r['is_amp'])
312
+ print(f"\n{cfg_name}:")
313
+ print(f" Total: {len(results)} | AMPs: {amp_count} ({amp_count/len(results)*100:.1f}%)")
314
+ print(f" Avg MIC: {np.mean(scores):.2f} μg/mL | Best MIC: {np.min(scores):.2f} μg/mL")
315
+
316
+ # Find best overall candidates
317
+ all_candidates = []
318
+ for cfg_name, results in all_results.items():
319
+ all_candidates.extend(results)
320
+
321
+ if all_candidates:
322
+ print(f"\n{'='*60}")
323
+ print("TOP 10 OVERALL CANDIDATES (All CFG Settings)")
324
+ print(f"{'='*60}")
325
+
326
+ top_overall = sorted(all_candidates, key=lambda x: x['apex_score'])[:10]
327
+ for i, candidate in enumerate(top_overall):
328
+ print(f"{i+1:2d}. MIC: {candidate['apex_score']:.2f} μg/mL | "
329
+ f"CFG: {candidate['cfg_setting']} | "
330
+ f"Sequence: {candidate['sequence']}")
331
+
332
+ # Create output directory if it doesn't exist
333
+ output_dir = '/data2/edwardsun/apex_results'
334
+ os.makedirs(output_dir, exist_ok=True)
335
+
336
+ # Save overall results with date
337
+ overall_results_file = os.path.join(output_dir, f'apex_results_all_cfg_comparison_{today}.json')
338
+ with open(overall_results_file, 'w') as f:
339
+ json.dump({
340
+ 'metadata': {
341
+ 'date': today,
342
+ 'total_sequences': len(all_candidates),
343
+ 'apex_available': APEX_AVAILABLE,
344
+ 'cfg_settings_tested': list(all_results.keys())
345
+ },
346
+ 'results': all_candidates
347
+ }, f, indent=2)
348
+ print(f"\n✓ Overall results saved to {overall_results_file}")
349
+
350
+ # Save comprehensive MIC summary
351
+ mic_summary_file = os.path.join(output_dir, f'mic_summary_{today}.json')
352
+ mic_summary = {
353
+ 'date': today,
354
+ 'summary_by_cfg': {},
355
+ 'all_mics': [r['apex_score'] for r in all_candidates],
356
+ 'amp_count': sum(1 for r in all_candidates if r['is_amp']),
357
+ 'total_sequences': len(all_candidates)
358
+ }
359
+
360
+ for cfg_name, results in all_results.items():
361
+ if results:
362
+ scores = [r['apex_score'] for r in results]
363
+ amp_count = sum(1 for r in results if r['is_amp'])
364
+ mic_summary['summary_by_cfg'][cfg_name] = {
365
+ 'num_sequences': len(results),
366
+ 'amp_count': amp_count,
367
+ 'amp_percentage': amp_count/len(results)*100,
368
+ 'avg_mic': np.mean(scores),
369
+ 'min_mic': np.min(scores),
370
+ 'max_mic': np.max(scores),
371
+ 'std_mic': np.std(scores),
372
+ 'all_mics': scores
373
+ }
374
+
375
+ with open(mic_summary_file, 'w') as f:
376
+ json.dump(mic_summary, f, indent=2)
377
+ print(f"✓ MIC summary saved to {mic_summary_file}")
378
+
379
+ print(f"\n✅ APEX testing completed successfully!")
380
+ print(f"Tested {len(all_candidates)} total sequences across all CFG settings")
381
+
382
+ if __name__ == "__main__":
383
+ main()
usage_example.py ADDED
@@ -0,0 +1,60 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ FlowAMP Usage Example
4
+ This script demonstrates how to use the FlowAMP model for AMP generation.
5
+ Note: This is a demonstration version. For full functionality, you'll need to train the model.
6
+ """
7
+
8
+ import torch
9
+ from final_flow_model import AMPFlowMatcherCFGConcat
10
+
11
+ def main():
12
+ print("=== FlowAMP Usage Example ===")
13
+ print("This demonstrates the model architecture and usage.")
14
+
15
+ if torch.cuda.is_available():
16
+ device = torch.device("cuda")
17
+ print("Using CUDA")
18
+ else:
19
+ device = torch.device("cpu")
20
+ print("Using CPU")
21
+
22
+ # Initialize model
23
+ model = AMPFlowMatcherCFGConcat(
24
+ hidden_dim=480,
25
+ compressed_dim=80,
26
+ n_layers=4,
27
+ n_heads=8,
28
+ dim_ff=1920,
29
+ dropout=0.1,
30
+ max_seq_len=25,
31
+ use_cfg=True
32
+ ).to(device)
33
+
34
+ print("Model initialized successfully!")
35
+ print(f"Model parameters: {sum(p.numel() for p in model.parameters()):,}")
36
+
37
+ # Demonstrate model forward pass
38
+ batch_size = 2
39
+ seq_len = 25
40
+ compressed_dim = 80
41
+
42
+ # Create dummy input
43
+ x = torch.randn(batch_size, seq_len, compressed_dim).to(device)
44
+ time_steps = torch.rand(batch_size, 1).to(device)
45
+
46
+ # Forward pass
47
+ with torch.no_grad():
48
+ output = model(x, time_steps)
49
+
50
+ print(f"Input shape: {x.shape}")
51
+ print(f"Output shape: {output.shape}")
52
+ print("✓ Model forward pass successful!")
53
+
54
+ print("\nTo use this model for AMP generation:")
55
+ print("1. Train the model using the provided training scripts")
56
+ print("2. Use generate_amps.py for peptide generation")
57
+ print("3. Use test_generated_peptides.py for evaluation")
58
+
59
+ if __name__ == "__main__":
60
+ main()