esunAI commited on Aug 25, 2025

Commit

370f342

0 Parent(s):

Initial FlowAMP upload: Complete project with all essential files

Browse files

Files changed (29) hide show

MODEL_FILES_INFO.md +31 -0
README.md +159 -0
UPLOAD_INSTRUCTIONS.txt +48 -0
amp_flow_training_multi_gpu.py +439 -0
amp_flow_training_single_gpu_full_data.py +561 -0
apex/AMP_DL_model_twohead.py +113 -0
apex/Predicted_MICs.csv +11 -0
apex/README.md +24 -0
apex/aaindex1.csv +567 -0
apex/best_key_list +8 -0
apex/predict.py +129 -0
apex/requirement.txt +7 -0
apex/test_seqs.txt +10 -0
apex/utils.py +126 -0
cfg_dataset.py +324 -0
compressor_with_embeddings.py +278 -0
final_flow_model.py +310 -0
final_sequence_decoder.py +338 -0
final_sequence_encoder.py +215 -0
generate_amps.py +215 -0
launch_full_data_training.sh +118 -0
launch_multi_gpu_training.sh +85 -0
model_card.md +127 -0
monitor_training.sh +53 -0
normalization_stats.pt +0 -0
requirements.txt +9 -0
requirements.yaml +31 -0
test_generated_peptides.py +383 -0
usage_example.py +60 -0

MODEL_FILES_INFO.md ADDED Viewed

	@@ -0,0 +1,31 @@

+# Model Files Information
+## Available Files
+- normalization_stats.pt: Preprocessing statistics for ESM-2 embeddings
+## Missing Files (Too Large for Hugging Face)
+The following model files are too large for Hugging Face upload (>100MB limit):
+### Large Model Files (Not Included)
+- flowamp_demo_checkpoint.pth (~1.5GB): Complete model checkpoint
+- compressor_demo.pth (~315MB): Compressor weights
+- decompressor_demo.pth (~158MB): Decompressor weights
+- flow_model_demo.pth (~54MB): Flow model weights
+- apex/trained_models/* (~1GB total): Pre-trained Apex models
+### How to Get Model Files
+1. **Train Your Own**: Use the provided training scripts to train the model
+2. **Contact Author**: Request model files directly from the author
+3. **Alternative Storage**: Model files may be available on other platforms
+### Training Instructions
+1. Run the training scripts to generate your own model checkpoints
+2. Use amp_flow_training_single_gpu_full_data.py for single GPU training
+3. Use amp_flow_training_multi_gpu.py for multi-GPU training
+4. Models will be saved automatically during training
+### Quick Start
+1. Install dependencies: pip install -r requirements.txt
+2. Run usage_example.py to verify installation
+3. Train the model using provided scripts
+4. Use generate_amps.py for AMP generation

README.md ADDED Viewed

	@@ -0,0 +1,159 @@

+# FlowAMP: Flow-based Antimicrobial Peptide Generation
+## Overview
+FlowAMP is a novel flow-based generative model for designing antimicrobial peptides (AMPs) using conditional flow matching and ESM-2 protein language model embeddings. This project implements a state-of-the-art approach for de novo AMP design with improved generation quality and diversity.
+## Key Features
+- **Flow-based Generation**: Uses conditional flow matching for high-quality peptide generation
+- **ESM-2 Integration**: Leverages ESM-2 protein language model embeddings for sequence understanding
+- **CFG Training**: Implements Classifier-Free Guidance for controllable generation
+- **Multi-GPU Training**: Optimized for H100 GPUs with mixed precision training
+- **Comprehensive Evaluation**: MIC prediction and antimicrobial activity assessment
+## Project Structure
+```
+flow/
+├── final_flow_model.py              # Main FlowAMP model architecture
+├── final_sequence_encoder.py        # ESM-2 sequence encoding
+├── final_sequence_decoder.py        # Sequence decoding and generation
+├── compressor_with_embeddings.py    # Embedding compression/decompression
+├── cfg_dataset.py                   # CFG dataset and dataloader
+├── amp_flow_training_single_gpu_full_data.py  # Single GPU training
+├── amp_flow_training_multi_gpu.py   # Multi-GPU training
+├── generate_amps.py                 # AMP generation script
+├── test_generated_peptides.py       # Evaluation and testing
+├── apex/                           # Apex model integration
+│   ├── trained_models/             # Pre-trained Apex models
+│   └── AMP_DL_model_twohead.py     # Apex model architecture
+├── normalization_stats.pt          # Preprocessing statistics
+└── requirements.yaml               # Dependencies
+```
+## Model Architecture
+The FlowAMP model consists of:
+1. **ESM-2 Encoder**: Extracts protein sequence embeddings using ESM-2
+2. **Compressor/Decompressor**: Reduces embedding dimensionality for efficiency
+3. **Flow Matcher**: Conditional flow matching for generation
+4. **CFG Integration**: Classifier-free guidance for controllable generation
+## Training
+### Single GPU Training
+```bash
+python amp_flow_training_single_gpu_full_data.py
+```
+### Multi-GPU Training
+```bash
+bash launch_multi_gpu_training.sh
+```
+### Key Training Parameters
+- **Batch Size**: 96 (optimized for H100)
+- **Learning Rate**: 4e-4 with cosine annealing
+- **Epochs**: 6000
+- **Mixed Precision**: BF16 for H100 optimization
+- **CFG Dropout**: 15% for unconditional training
+## Generation
+Generate AMPs with different CFG strengths:
+```bash
+python generate_amps.py --cfg_strength 0.0    # No CFG
+python generate_amps.py --cfg_strength 1.0    # Weak CFG
+python generate_amps.py --cfg_strength 2.0    # Strong CFG
+python generate_amps.py --cfg_strength 3.0    # Very Strong CFG
+```
+## Evaluation
+### MIC Prediction
+The model includes integration with Apex for MIC (Minimum Inhibitory Concentration) prediction:
+```bash
+python test_generated_peptides.py
+```
+### Performance Metrics
+- **Generation Quality**: Evaluated using sequence diversity and validity
+- **Antimicrobial Activity**: Predicted using Apex model integration
+- **CFG Effectiveness**: Measured through controlled generation
+## Results
+### Training Performance
+- **Optimized for H100**: 31 steps/second with batch size 96
+- **Mixed Precision**: BF16 training for memory efficiency
+- **Gradient Clipping**: Stable training with norm=1.0
+### Generation Results
+- **Sequence Validity**: High percentage of valid peptide sequences
+- **Diversity**: Good sequence diversity across different CFG strengths
+- **Antimicrobial Potential**: Predicted MIC values for generated sequences
+## Dependencies
+Key dependencies include:
+- PyTorch 2.0+
+- Transformers (for ESM-2)
+- Wandb (optional logging)
+- Apex (for MIC prediction)
+See `requirements.yaml` for complete dependency list.
+## Usage Examples
+### Basic AMP Generation
+```python
+from final_flow_model import AMPFlowMatcherCFGConcat
+from generate_amps import generate_amps
+# Load trained model
+model = AMPFlowMatcherCFGConcat.load_from_checkpoint('path/to/checkpoint.pth')
+# Generate AMPs
+sequences = generate_amps(model, num_samples=100, cfg_strength=1.0)
+```
+### Evaluation
+```python
+from test_generated_peptides import evaluate_generated_peptides
+# Evaluate generated sequences
+results = evaluate_generated_peptides(sequences)
+```
+## Research Impact
+This work contributes to:
+- **Flow-based Protein Design**: Novel application of flow matching to peptide generation
+- **Conditional Generation**: CFG integration for controllable AMP design
+- **ESM-2 Integration**: Leveraging protein language models for sequence understanding
+- **Antimicrobial Discovery**: Automated design of potential therapeutic peptides
+## Citation
+If you use this code in your research, please cite:
+```bibtex
+@article{flowamp2024,
+  title={FlowAMP: Flow-based Antimicrobial Peptide Generation with Conditional Flow Matching},
+  author={Sun, Edward},
+  journal={arXiv preprint},
+  year={2024}
+}
+```
+## License
+MIT License - see LICENSE file for details.
+## Contact
+For questions or collaboration, please contact the authors.

UPLOAD_INSTRUCTIONS.txt ADDED Viewed

	@@ -0,0 +1,48 @@

+=== Upload Instructions ===
+1. Navigate to the upload directory:
+   cd flowamp_upload_small
+2. Initialize git repository:
+   git init
+   git add .
+   git commit -m "Initial FlowAMP upload (small version)"
+3. Add Hugging Face remote:
+   git remote add origin https://huggingface.co/esunAI/FlowAMP
+4. Push to Hugging Face:
+   git push -u origin main
+=== Files Included ===
+Core Model:
+- final_flow_model.py: Main FlowAMP model architecture
+- final_sequence_encoder.py: ESM-2 sequence encoding
+- final_sequence_decoder.py: Sequence decoding and generation
+- compressor_with_embeddings.py: Embedding compression/decompression
+- cfg_dataset.py: CFG dataset and dataloader
+Training:
+- amp_flow_training_single_gpu_full_data.py: Single GPU training
+- amp_flow_training_multi_gpu.py: Multi-GPU training
+- launch_*.sh: Training launch scripts
+Models:
+- normalization_stats.pt: Preprocessing statistics
+- MODEL_FILES_INFO.md: Information about missing large model files
+Apex Integration:
+- apex/AMP_DL_model_twohead.py: Apex model architecture
+- apex/predict.py: MIC prediction script
+Documentation:
+- README.md: Comprehensive project documentation
+- model_card.md: Hugging Face model card
+- usage_example.py: Usage demonstration
+- requirements.txt: Python dependencies
+=== Note ===
+This is a smaller version without large model files due to Hugging Face size limits.
+See MODEL_FILES_INFO.md for details on obtaining model weights.

amp_flow_training_multi_gpu.py ADDED Viewed

	@@ -0,0 +1,439 @@

+import torch
+import torch.nn as nn
+import torch.optim as optim
+from torch.utils.data import DataLoader
+from torch.optim.lr_scheduler import CosineAnnealingLR, LinearLR, SequentialLR
+from torch.nn.parallel import DistributedDataParallel as DDP
+from torch.utils.data.distributed import DistributedSampler
+import torch.distributed as dist
+import numpy as np
+from tqdm import tqdm
+import json
+import os
+import argparse
+# Import your existing components
+from compressor_with_embeddings import Compressor, Decompressor, PrecomputedEmbeddingDataset
+from final_flow_model import AMPFlowMatcherCFGConcat, SinusoidalTimeEmbedding
+from cfg_dataset import CFGFlowDataset, create_cfg_dataloader
+# ---------------- Configuration ----------------
+ESM_DIM = 1280  # ESM-2 hidden dim (esm2_t33_650M_UR50D)
+COMP_RATIO = 16  # compression factor
+COMP_DIM = ESM_DIM // COMP_RATIO
+MAX_SEQ_LEN = 50  # Actual sequence length from final_sequence_encoder.py
+BATCH_SIZE = 64  # Per GPU batch size (256 total across 4 GPUs) - increased for faster training
+EPOCHS = 5000  # Extended to 5K iterations for more comprehensive training (~50 minutes)
+BASE_LR = 1e-4  # initial learning rate
+LR_MIN = 2e-5  # minimum learning rate for cosine schedule
+WARMUP_STEPS = 100  # Reduced warmup for shorter training
+def setup_distributed():
+    """Setup distributed training."""
+    if 'RANK' in os.environ and 'WORLD_SIZE' in os.environ:
+        rank = int(os.environ["RANK"])
+        world_size = int(os.environ['WORLD_SIZE'])
+        local_rank = int(os.environ['LOCAL_RANK'])
+    else:
+        print('Not using distributed mode')
+        return None, None, None
+    torch.cuda.set_device(local_rank)
+    dist.init_process_group(backend='nccl', init_method='env://')
+    dist.barrier()
+    return rank, world_size, local_rank
+class AMPFlowTrainerMultiGPU:
+    """
+    Multi-GPU training pipeline for AMP generation using ProtFlow methodology.
+    """
+    def __init__(self, embeddings_path, cfg_data_path, rank, world_size, local_rank):
+        self.rank = rank
+        self.world_size = world_size
+        self.local_rank = local_rank
+        self.device = torch.device(f'cuda:{local_rank}')
+        self.embeddings_path = embeddings_path
+        self.cfg_data_path = cfg_data_path
+        # Load ALL pre-computed embeddings (only on main process)
+        if self.rank == 0:
+            print(f"Loading ALL AMP embeddings from {embeddings_path}...")
+            # Try to load the combined embeddings file first (FULL DATA)
+            combined_path = os.path.join(embeddings_path, "all_peptide_embeddings.pt")
+            if os.path.exists(combined_path):
+                print(f"Loading combined embeddings from {combined_path} (FULL DATA)...")
+                self.embeddings = torch.load(combined_path, map_location=self.device)
+                print(f"✓ Loaded ALL embeddings: {self.embeddings.shape}")
+            else:
+                print("Combined embeddings file not found, loading individual files...")
+                # Fallback to individual files
+                import glob
+                embedding_files = glob.glob(os.path.join(embeddings_path, "*.pt"))
+                embedding_files = [f for f in embedding_files if not f.endswith('metadata.json') and not f.endswith('sequence_ids.json') and not f.endswith('all_peptide_embeddings.pt')]
+                print(f"Found {len(embedding_files)} individual embedding files")
+                # Load and stack all embeddings
+                embeddings_list = []
+                for file_path in embedding_files:
+                    try:
+                        embedding = torch.load(file_path)
+                        if embedding.dim() == 2:  # (seq_len, hidden_dim)
+                            embeddings_list.append(embedding)
+                        else:
+                            print(f"Warning: Skipping {file_path} - unexpected shape {embedding.shape}")
+                    except Exception as e:
+                        print(f"Warning: Could not load {file_path}: {e}")
+                if not embeddings_list:
+                    raise ValueError("No valid embeddings found!")
+                self.embeddings = torch.stack(embeddings_list)
+                print(f"Loaded {len(self.embeddings)} embeddings from individual files")
+            # Compute normalization statistics
+            print("Computing preprocessing statistics...")
+            self._compute_preprocessing_stats()
+        # Broadcast statistics to all processes
+        if self.rank == 0:
+            stats_tensor = torch.stack([
+                self.stats['mean'], self.stats['std'],
+                self.stats['min'], self.stats['max']
+            ]).to(self.device)
+        else:
+            stats_tensor = torch.zeros(4, ESM_DIM, device=self.device)
+        dist.broadcast(stats_tensor, src=0)
+        if self.rank != 0:
+            self.stats = {
+                'mean': stats_tensor[0],
+                'std': stats_tensor[1],
+                'min': stats_tensor[2],
+                'max': stats_tensor[3]
+            }
+        # Initialize models
+        self._initialize_models()
+    def _compute_preprocessing_stats(self):
+        """Compute preprocessing statistics (only on main process)."""
+        # Flatten all embeddings
+        flat = self.embeddings.view(-1, ESM_DIM)
+        # 1. Z-score normalization statistics
+        feat_mean = flat.mean(0)
+        feat_std = flat.std(0) + 1e-8
+        # 2. Truncation statistics (after z-score)
+        z_score_normalized = (flat - feat_mean) / feat_std
+        z_score_clamped = torch.clamp(z_score_normalized, -4, 4)
+        # 3. Min-max normalization statistics (after truncation)
+        feat_min = z_score_clamped.min(0)[0]
+        feat_max = z_score_clamped.max(0)[0]
+        # Store statistics
+        self.stats = {
+            'mean': feat_mean,
+            'std': feat_std,
+            'min': feat_min,
+            'max': feat_max
+        }
+        # Save statistics for later use
+        torch.save(self.stats, 'normalization_stats.pt')
+        if self.rank == 0:
+            print("✓ Preprocessing statistics computed and saved to normalization_stats.pt")
+    def _initialize_models(self):
+        """Initialize models for distributed training."""
+        # Load pre-trained compressor and decompressor
+        self.compressor = Compressor().to(self.device)
+        self.decompressor = Decompressor().to(self.device)
+        # Load trained weights
+        self.compressor.load_state_dict(torch.load('final_compressor_model.pth', map_location=self.device))
+        self.decompressor.load_state_dict(torch.load('final_decompressor_model.pth', map_location=self.device))
+        # Initialize flow matching model with CFG
+        self.flow_model = AMPFlowMatcherCFGConcat(
+            hidden_dim=480,
+            compressed_dim=COMP_DIM,
+            n_layers=12,
+            n_heads=16,
+            dim_ff=3072,
+            max_seq_len=25,
+            use_cfg=True
+        ).to(self.device)
+        # Wrap with DDP
+        self.flow_model = DDP(self.flow_model, device_ids=[self.local_rank], find_unused_parameters=True)
+        if self.rank == 0:
+            print("✓ Initialized models for distributed training")
+            print(f"  - Flow model parameters: {sum(p.numel() for p in self.flow_model.parameters()):,}")
+            print(f"  - Using {self.world_size} GPUs")
+    def _preprocess_batch(self, batch):
+        """Apply preprocessing to a batch of embeddings."""
+        # 1. Z-score normalization
+        h_norm = (batch - self.stats['mean'].to(batch.device)) / self.stats['std'].to(batch.device)
+        # 2. Truncation (saturation) of outliers
+        h_trunc = torch.clamp(h_norm, min=-4.0, max=4.0)
+        # 3. Min-max normalization per dimension
+        h_min = self.stats['min'].to(batch.device)
+        h_max = self.stats['max'].to(batch.device)
+        h_scaled = (h_trunc - h_min) / (h_max - h_min + 1e-8)
+        h_scaled = torch.clamp(h_scaled, 0.0, 1.0)
+        return h_scaled
+    def train_flow_matching(self):
+        """Train the flow matching model using distributed training."""
+        if self.rank == 0:
+            print("Step 3: Training Flow Matching model (Multi-GPU)...")
+        # Create CFG dataset and distributed data loader
+        try:
+            # Try to use CFG dataset with real labels
+            dataset = CFGFlowDataset(
+                embeddings_path=self.embeddings_path,
+                cfg_data_path=self.cfg_data_path,
+                use_masked_labels=True,
+                max_seq_len=MAX_SEQ_LEN,
+                device=self.device
+            )
+            print("✓ Using CFG dataset with real labels")
+        except Exception as e:
+            print(f"Warning: Could not load CFG dataset: {e}")
+            print("Falling back to random labels (not recommended for CFG)")
+            # Fallback to original dataset with random labels
+            dataset = PrecomputedEmbeddingDataset(self.embeddings_path)
+        sampler = DistributedSampler(dataset, num_replicas=self.world_size, rank=self.rank)
+        dataloader = DataLoader(dataset, batch_size=BATCH_SIZE, sampler=sampler, num_workers=4)
+        # Initialize optimizer
+        optimizer = optim.AdamW(
+            self.flow_model.parameters(),
+            lr=BASE_LR,
+            betas=(0.9, 0.98),
+            weight_decay=0.01,
+            eps=1e-6
+        )
+        # LR scheduling: warmup -> cosine
+        warmup_sched = LinearLR(optimizer, start_factor=1e-8, end_factor=1.0, total_iters=WARMUP_STEPS)
+        cosine_sched = CosineAnnealingLR(optimizer, T_max=EPOCHS, eta_min=LR_MIN)
+        scheduler = SequentialLR(optimizer, [warmup_sched, cosine_sched], milestones=[WARMUP_STEPS])
+        # Training loop
+        self.flow_model.train()
+        total_steps = 0
+        if self.rank == 0:
+            print(f"Starting training for {EPOCHS} iterations with FULL DATA...")
+            print(f"Total batch size: {BATCH_SIZE * self.world_size}")
+            print(f"Steps per epoch: {len(dataloader)}")
+            print(f"Total samples: {len(dataset):,}")
+            print(f"Estimated time: ~30-45 minutes (using ALL data)")
+        for epoch in range(EPOCHS):
+            sampler.set_epoch(epoch)  # Ensure different shuffling per epoch
+            for batch_idx, batch_data in enumerate(dataloader):
+                # Handle different data formats
+                if isinstance(batch_data, dict) and 'embeddings' in batch_data:
+                    # CFG dataset format
+                    x = batch_data['embeddings'].to(self.device)
+                    labels = batch_data['labels'].to(self.device)
+                else:
+                    # Original dataset format - use random labels
+                    x = batch_data.to(self.device)
+                    labels = torch.randint(0, 3, (x.shape[0],), device=self.device)
+                batch_size = x.shape[0]
+                # Apply preprocessing
+                x_processed = self._preprocess_batch(x)
+                # Compress to latent space
+                with torch.no_grad():
+                    z = self.compressor(x_processed, self.stats)
+                # Sample random noise
+                eps = torch.randn_like(z)
+                # Sample random time
+                t = torch.rand(batch_size, device=self.device)
+                # Interpolate between data and noise
+                xt = t.view(batch_size, 1, 1) * eps + (1 - t.view(batch_size, 1, 1)) * z
+                # Target vector field for rectified flow
+                ut = eps - z
+                # Use real labels from CFG dataset or random labels as fallback
+                # labels are already defined above based on dataset type
+                # Predict vector field with CFG
+                vt_pred = self.flow_model(xt, t, labels=labels)
+                # CFM loss
+                loss = ((vt_pred - ut) ** 2).mean()
+                # Backward pass
+                optimizer.zero_grad()
+                loss.backward()
+                # Gradient clipping
+                torch.nn.utils.clip_grad_norm_(self.flow_model.parameters(), 1.0)
+                optimizer.step()
+                scheduler.step()
+                total_steps += 1
+                # Logging (only on main process) - more frequent for short training
+                if self.rank == 0 and total_steps % 10 == 0:
+                    progress = (total_steps / EPOCHS) * 100
+                    label_dist = torch.bincount(labels, minlength=3)
+                    print(f"Step {total_steps}/{EPOCHS} ({progress:.1f}%): Loss = {loss.item():.6f}, LR = {scheduler.get_last_lr()[0]:.2e}, Labels: AMP={label_dist[0]}, Non-AMP={label_dist[1]}, Mask={label_dist[2]}")
+                # Save checkpoint (only on main process) - more frequent for short training
+                if self.rank == 0 and total_steps % 100 == 0:
+                    self._save_checkpoint(total_steps, loss.item())
+                # Validation (only on main process) - more frequent for short training
+                if self.rank == 0 and total_steps % 200 == 0:
+                    self._validate()
+        # Save final model (only on main process)
+        if self.rank == 0:
+            self._save_checkpoint(total_steps, loss.item(), is_final=True)
+            print("✓ Flow matching training completed!")
+    def _save_checkpoint(self, step, loss, is_final=False):
+        """Save training checkpoint (only on main process)."""
+        # Get the underlying model from DDP
+        model_state_dict = self.flow_model.module.state_dict()
+        checkpoint = {
+            'step': step,
+            'flow_model_state_dict': model_state_dict,
+            'loss': loss,
+        }
+        if is_final:
+            torch.save(checkpoint, 'amp_flow_model_final_full_data.pth')
+            print(f"✓ Final model saved: amp_flow_model_final_full_data.pth")
+        else:
+            torch.save(checkpoint, f'amp_flow_checkpoint_full_data_step_{step}.pth')
+            print(f"✓ Checkpoint saved: amp_flow_checkpoint_full_data_step_{step}.pth")
+    def _validate(self):
+        """Validate the model by generating a few samples."""
+        print("Generating validation samples...")
+        self.flow_model.eval()
+        with torch.no_grad():
+            # Generate a few samples
+            eps = torch.randn(4, 25, COMP_DIM, device=self.device)
+            xt = eps.clone()
+            # 25-step generation with CFG (using AMP label)
+            labels = torch.full((4,), 0, device=self.device)  # 0 = AMP
+            for step in range(25):
+                t = torch.ones(4, device=self.device) * (1.0 - step/25)
+                vt = self.flow_model(xt, t, labels=labels)
+                dt = 1.0 / 25
+                xt = xt + vt * dt
+            # Decompress
+            decompressed = self.decompressor(xt)
+            # Apply reverse preprocessing
+            m, s, mn, mx = self.stats['mean'].to(self.device), self.stats['std'].to(self.device), self.stats['min'].to(self.device), self.stats['max'].to(self.device)
+            decompressed = decompressed * (mx - mn + 1e-8) + mn
+            decompressed = decompressed * s + m
+            print(f"  Generated samples shape: {decompressed.shape}")
+            print(f"  Sample stats - Mean: {decompressed.mean():.4f}, Std: {decompressed.std():.4f}")
+        self.flow_model.train()
+def main():
+    """Main training function with distributed setup."""
+    parser = argparse.ArgumentParser()
+    parser.add_argument('--local_rank', type=int, default=0)
+    parser.add_argument('--cfg_data_path', type=str, default='/data2/edwardsun/flow_project/test_uniprot_processed/uniprot_processed_data.json',
+                       help='Path to FULL CFG training data with real labels')
+    args = parser.parse_args()
+    # Setup distributed training
+    rank, world_size, local_rank = setup_distributed()
+    if rank == 0:
+        print("=== Multi-GPU AMP Flow Matching Training Pipeline with FULL DATA ===")
+        print("This implements the complete ProtFlow methodology for AMP generation.")
+        print("Training for 5,000 iterations (~30-45 minutes) using ALL available data.")
+        print()
+        # Check if required files exist
+        required_files = [
+            'final_compressor_model.pth',
+            'final_decompressor_model.pth',
+            '/data2/edwardsun/flow_project/peptide_embeddings/'
+        ]
+        for file in required_files:
+            if not os.path.exists(file):
+                print(f"❌ Missing required file: {file}")
+                print("Please ensure you have:")
+                print("1. Run final_sequence_encoder.py to generate embeddings")
+                print("2. Run compressor_with_embeddings.py to train compressor/decompressor")
+                return
+        # Check if CFG data exists
+        if not os.path.exists(args.cfg_data_path):
+            print(f"⚠️  CFG data not found: {args.cfg_data_path}")
+            print("Training will use random labels (not recommended for CFG)")
+            print("To use real labels, run uniprot_data_processor.py first")
+        else:
+            print(f"✓ CFG data found: {args.cfg_data_path}")
+        print("✓ All required files found!")
+        print()
+    # Initialize trainer
+    trainer = AMPFlowTrainerMultiGPU(
+        embeddings_path='/data2/edwardsun/flow_project/peptide_embeddings/',
+        cfg_data_path=args.cfg_data_path,
+        rank=rank,
+        world_size=world_size,
+        local_rank=local_rank
+    )
+    # Train flow matching model
+    trainer.train_flow_matching()
+    if rank == 0:
+        print("\n=== Multi-GPU Training Complete with FULL DATA ===")
+        print("Your AMP flow matching model trained on ALL available data!")
+        print("Next steps:")
+        print("1. Test the model: python generate_amps.py")
+        print("2. Compare performance with previous model")
+        print("3. Implement reflow for 1-step generation")
+        print("4. Add conditioning for toxicity (future project)")
+if __name__ == "__main__":
+    main()

amp_flow_training_single_gpu_full_data.py ADDED Viewed

	@@ -0,0 +1,561 @@

+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import torch.optim as optim
+from torch.utils.data import DataLoader
+from torch.optim.lr_scheduler import CosineAnnealingLR, LinearLR, SequentialLR
+import numpy as np
+from tqdm import tqdm
+import json
+import os
+import argparse
+import time
+from torch.cuda.amp import autocast, GradScaler
+import wandb  # For logging (optional)
+# Import your existing components
+from compressor_with_embeddings import Compressor, Decompressor, PrecomputedEmbeddingDataset
+from final_flow_model import AMPFlowMatcherCFGConcat, SinusoidalTimeEmbedding
+from cfg_dataset import CFGFlowDataset, create_cfg_dataloader
+# ---------------- Optimized Configuration for H100 ----------------
+ESM_DIM = 1280  # ESM-2 hidden dim (esm2_t33_650M_UR50D)
+COMP_RATIO = 16  # compression factor
+COMP_DIM = ESM_DIM // COMP_RATIO
+MAX_SEQ_LEN = 50  # Actual sequence length from final_sequence_encoder.py
+# Optimized hyperparameters for H100 overnight training
+BATCH_SIZE = 96  # Optimized based on profiling (fastest speed: 31 steps/s)
+EPOCHS = 6000  # Adjusted for 8-10 hours with batch size 96 (31 steps/s)
+BASE_LR = 4e-4  # Increased from 1e-4 (scaled with batch size)
+LR_MIN = 2e-4  # Minimum learning rate for cosine schedule
+WARMUP_STEPS = 5000  # 5% of total iterations for warmup
+GPU_ID = 3  # Use GPU 3 (the idle one)
+# Training optimizations
+USE_MIXED_PRECISION = True  # BF16 for H100
+GRADIENT_CLIP_NORM = 1.0  # Gradient clipping for stability
+WEIGHT_DECAY = 0.01  # Weight decay for regularization
+VALIDATION_INTERVAL = 10000  # Validate every 10K steps
+CHECKPOINT_INTERVAL = 1000  # Save checkpoint every 1000 epochs
+NUM_WORKERS = 16  # Increased data loading workers
+# CFG training parameters
+CFG_DROPOUT_RATE = 0.15  # 15% of batches as unconditional for CFG
+class AMPFlowTrainerSingleGPUFullData:
+    """
+    Optimized Single GPU training pipeline for AMP generation using ProtFlow methodology.
+    Uses ALL available data with H100-optimized settings for overnight training.
+    """
+    def __init__(self, embeddings_path, cfg_data_path, use_wandb=False):
+        self.device = torch.device(f'cuda:{GPU_ID}')
+        self.embeddings_path = embeddings_path
+        self.cfg_data_path = cfg_data_path
+        self.use_wandb = use_wandb
+        # Enable H100 optimizations
+        torch.backends.cuda.matmul.allow_tf32 = True
+        torch.backends.cudnn.allow_tf32 = True
+        print(f"Using GPU {GPU_ID} for optimized H100 training")
+        print(f"Mixed precision: {USE_MIXED_PRECISION}")
+        print(f"Batch size: {BATCH_SIZE}")
+        print(f"Target epochs: {EPOCHS}")
+        print(f"Learning rate: {BASE_LR} -> {LR_MIN}")
+        # Initialize mixed precision training
+        if USE_MIXED_PRECISION:
+            self.scaler = GradScaler()
+            print("✓ Mixed precision training enabled (BF16)")
+        # Initialize wandb if requested
+        if self.use_wandb:
+            wandb.init(
+                project="amp-flow-training",
+                config={
+                    "batch_size": BATCH_SIZE,
+                    "epochs": EPOCHS,
+                    "base_lr": BASE_LR,
+                    "lr_min": LR_MIN,
+                    "warmup_steps": WARMUP_STEPS,
+                    "mixed_precision": USE_MIXED_PRECISION,
+                    "gradient_clip": GRADIENT_CLIP_NORM,
+                    "weight_decay": WEIGHT_DECAY
+                }
+            )
+        print(f"Loading ALL AMP embeddings from {embeddings_path}...")
+        # Load ALL embeddings (use the combined file instead of individual files)
+        self._load_all_embeddings()
+        # Compute normalization statistics
+        print("Computing preprocessing statistics...")
+        self._compute_preprocessing_stats()
+        # Initialize models
+        self._initialize_models()
+        # Initialize datasets and dataloaders
+        self._initialize_data()
+        # Initialize optimizer and scheduler
+        self._initialize_optimizer()
+        print("✓ Optimized Single GPU training setup complete with FULL DATA!")
+    def _load_all_embeddings(self):
+        """Load ALL peptide embeddings from the combined file."""
+        # Try to load the combined embeddings file first
+        combined_path = os.path.join(self.embeddings_path, "all_peptide_embeddings.pt")
+        if os.path.exists(combined_path):
+            print(f"Loading combined embeddings from {combined_path}...")
+            self.embeddings = torch.load(combined_path, map_location=self.device)
+            print(f"✓ Loaded ALL embeddings: {self.embeddings.shape}")
+        else:
+            print("Combined embeddings file not found, loading individual files...")
+            # Fallback to individual files
+            import glob
+            embedding_files = glob.glob(os.path.join(self.embeddings_path, "*.pt"))
+            embedding_files = [f for f in embedding_files if not f.endswith('metadata.json') and not f.endswith('sequence_ids.json') and not f.endswith('all_peptide_embeddings.pt')]
+            print(f"Found {len(embedding_files)} individual embedding files")
+            # Load and stack all embeddings
+            embeddings_list = []
+            for file_path in embedding_files:
+                try:
+                    embedding = torch.load(file_path)
+                    if embedding.dim() == 2:  # (seq_len, hidden_dim)
+                        embeddings_list.append(embedding)
+                    else:
+                        print(f"Warning: Skipping {file_path} - unexpected shape {embedding.shape}")
+                except Exception as e:
+                    print(f"Warning: Could not load {file_path}: {e}")
+            if not embeddings_list:
+                raise ValueError("No valid embeddings found!")
+            self.embeddings = torch.stack(embeddings_list)
+            print(f"Loaded {len(self.embeddings)} embeddings from individual files")
+    def _compute_preprocessing_stats(self):
+        """Compute normalization statistics for embeddings."""
+        # Flatten all embeddings
+        flat_embeddings = self.embeddings.reshape(-1, ESM_DIM)
+        # Compute statistics
+        mean = flat_embeddings.mean(dim=0)
+        std = flat_embeddings.std(dim=0)
+        min_val = flat_embeddings.min()
+        max_val = flat_embeddings.max()
+        self.stats = {
+            'mean': mean,
+            'std': std,
+            'min': min_val,
+            'max': max_val
+        }
+        # Save statistics
+        torch.save(self.stats, 'normalization_stats.pt')
+        print(f"✓ Statistics computed and saved:")
+        print(f"  Total embeddings: {len(self.embeddings):,}")
+        print(f"  Mean: {mean.mean():.4f} ± {mean.std():.4f}")
+        print(f"  Std: {std.mean():.4f} ± {std.std():.4f}")
+        print(f"  Range: [{min_val:.4f}, {max_val:.4f}]")
+    def _initialize_models(self):
+        """Initialize compressor, decompressor, and flow model."""
+        print("Initializing models...")
+        # Load pre-trained compressor and decompressor
+        self.compressor = Compressor().to(self.device)
+        self.decompressor = Decompressor().to(self.device)
+        self.compressor.load_state_dict(torch.load('final_compressor_model.pth', map_location=self.device))
+        self.decompressor.load_state_dict(torch.load('final_decompressor_model.pth', map_location=self.device))
+        # Initialize flow model with CFG
+        self.flow_model = AMPFlowMatcherCFGConcat(
+            hidden_dim=480,
+            compressed_dim=COMP_DIM,
+            n_layers=12,
+            n_heads=16,
+            dim_ff=3072,
+            max_seq_len=25,  # MAX_SEQ_LEN // 2 due to pooling
+            use_cfg=True
+        ).to(self.device)
+        # Compile model for PyTorch 2.x speedup (if available)
+        try:
+            self.flow_model = torch.compile(self.flow_model, mode="reduce-overhead")
+            print("✓ Model compiled with torch.compile for speedup")
+        except Exception as e:
+            print(f"⚠️  Model compilation failed: {e}")
+        # Set models to training mode
+        self.compressor.train()
+        self.decompressor.train()
+        self.flow_model.train()
+        print(f"✓ Models initialized:")
+        print(f"  Compressor parameters: {sum(p.numel() for p in self.compressor.parameters()):,}")
+        print(f"  Decompressor parameters: {sum(p.numel() for p in self.decompressor.parameters()):,}")
+        print(f"  Flow model parameters: {sum(p.numel() for p in self.flow_model.parameters()):,}")
+    def _initialize_data(self):
+        """Initialize datasets and dataloaders with FULL data."""
+        print("Initializing datasets with FULL data...")
+        # Create CFG dataset with FULL UniProt data
+        self.cfg_dataset = CFGFlowDataset(
+            embeddings_path=self.embeddings_path,
+            cfg_data_path=self.cfg_data_path,
+            use_masked_labels=True,
+            max_seq_len=MAX_SEQ_LEN,
+            device=self.device
+        )
+        # Create dataloader with optimized settings
+        self.dataloader = create_cfg_dataloader(
+            self.cfg_dataset,
+            batch_size=BATCH_SIZE,
+            shuffle=True,
+            num_workers=NUM_WORKERS
+        )
+        # Calculate total steps and validation intervals
+        self.total_steps = len(self.dataloader) * EPOCHS
+        self.validation_steps = VALIDATION_INTERVAL
+        print(f"✓ Dataset initialized with FULL data:")
+        print(f"  Total samples: {len(self.cfg_dataset):,}")
+        print(f"  Batch size: {BATCH_SIZE}")
+        print(f"  Batches per epoch: {len(self.dataloader):,}")
+        print(f"  Total training steps: {self.total_steps:,}")
+        print(f"  Validation every: {self.validation_steps:,} steps")
+    def _initialize_optimizer(self):
+        """Initialize optimizer and learning rate scheduler."""
+        print("Initializing optimizer and scheduler...")
+        # Optimizer for flow model only (compressor/decompressor are frozen)
+        self.optimizer = optim.AdamW(
+            self.flow_model.parameters(),
+            lr=BASE_LR,
+            weight_decay=WEIGHT_DECAY,
+            betas=(0.9, 0.98),  # Optimized betas for flow matching
+            eps=1e-6  # Lower epsilon for numerical stability
+        )
+        # Learning rate scheduler with proper warmup and cosine annealing
+        warmup_scheduler = LinearLR(
+            self.optimizer,
+            start_factor=0.1,
+            end_factor=1.0,
+            total_iters=WARMUP_STEPS
+        )
+        main_scheduler = CosineAnnealingLR(
+            self.optimizer,
+            T_max=self.total_steps - WARMUP_STEPS,
+            eta_min=LR_MIN
+        )
+        self.scheduler = SequentialLR(
+            self.optimizer,
+            schedulers=[warmup_scheduler, main_scheduler],
+            milestones=[WARMUP_STEPS]
+        )
+        print(f"✓ Optimizer initialized:")
+        print(f"  Base LR: {BASE_LR}")
+        print(f"  Min LR: {LR_MIN}")
+        print(f"  Warmup steps: {WARMUP_STEPS}")
+        print(f"  Weight decay: {WEIGHT_DECAY}")
+        print(f"  Gradient clip norm: {GRADIENT_CLIP_NORM}")
+    def _preprocess_batch(self, batch):
+        """Preprocess a batch of data for training."""
+        # Extract data
+        embeddings = batch['embeddings'].to(self.device)  # (B, L, ESM_DIM)
+        labels = batch['labels'].to(self.device)  # (B,)
+        # Normalize embeddings
+        m, s = self.stats['mean'].to(self.device), self.stats['std'].to(self.device)
+        mn, mx = self.stats['min'].to(self.device), self.stats['max'].to(self.device)
+        embeddings = (embeddings - m) / (s + 1e-8)
+        embeddings = (embeddings - mn) / (mx - mn + 1e-8)
+        # Compress embeddings
+        with torch.no_grad():
+            compressed = self.compressor(embeddings)  # (B, L, COMP_DIM)
+        return compressed, labels
+    def _compute_validation_metrics(self):
+        """Compute validation metrics on a subset of data."""
+        self.flow_model.eval()
+        val_losses = []
+        # Use a subset of data for validation
+        val_samples = min(1000, len(self.cfg_dataset))
+        val_indices = torch.randperm(len(self.cfg_dataset))[:val_samples]
+        with torch.no_grad():
+            for i in range(0, val_samples, BATCH_SIZE):
+                batch_indices = val_indices[i:i+BATCH_SIZE]
+                batch_data = [self.cfg_dataset[idx] for idx in batch_indices]
+                # Collate batch
+                embeddings = torch.stack([item['embedding'] for item in batch_data])
+                labels = torch.stack([item['label'] for item in batch_data])
+                # Preprocess
+                compressed, labels = self._preprocess_batch({
+                    'embeddings': embeddings,
+                    'labels': labels
+                })
+                B, L, D = compressed.shape
+                # Sample random time
+                t = torch.rand(B, device=self.device)
+                # Sample random noise
+                eps = torch.randn_like(compressed)
+                # Compute target
+                xt = (1 - t.unsqueeze(-1).unsqueeze(-1)) * compressed + t.unsqueeze(-1).unsqueeze(-1) * eps
+                # Predict vector field
+                vt_pred = self.flow_model(xt, t, labels=labels)
+                # Target vector field
+                vt_target = eps - compressed
+                # Compute loss
+                loss = F.mse_loss(vt_pred, vt_target)
+                val_losses.append(loss.item())
+        self.flow_model.train()
+        return np.mean(val_losses)
+    def train_flow_matching(self):
+        """Train the flow matching model with FULL data and optimizations."""
+        print(f"🚀 Starting Optimized Single GPU Flow Matching Training with FULL DATA")
+        print(f"GPU: {GPU_ID}")
+        print(f"Total iterations: {EPOCHS}")
+        print(f"Batch size: {BATCH_SIZE}")
+        print(f"Total samples: {len(self.cfg_dataset):,}")
+        print(f"Mixed precision: {USE_MIXED_PRECISION}")
+        print(f"Estimated time: ~8-10 hours (overnight training with ALL data)")
+        print("=" * 60)
+        # Training loop
+        best_loss = float('inf')
+        losses = []
+        val_losses = []
+        global_step = 0
+        start_time = time.time()
+        for epoch in tqdm(range(EPOCHS), desc="Training Flow Model"):
+            epoch_losses = []
+            epoch_start_time = time.time()
+            for batch_idx, batch in enumerate(self.dataloader):
+                # Preprocess batch
+                compressed, labels = self._preprocess_batch(batch)
+                B, L, D = compressed.shape
+                # CFG training: randomly mask some labels for unconditional training
+                if torch.rand(1).item() < CFG_DROPOUT_RATE:
+                    labels = torch.full_like(labels, fill_value=-1)  # Unconditional
+                # Sample random time
+                t = torch.rand(B, device=self.device)  # (B,)
+                # Sample random noise
+                eps = torch.randn_like(compressed)  # (B, L, D)
+                # Compute target: x_t = (1-t) * x_0 + t * eps
+                xt = (1 - t.unsqueeze(-1).unsqueeze(-1)) * compressed + t.unsqueeze(-1).unsqueeze(-1) * eps
+                # Forward pass with mixed precision
+                if USE_MIXED_PRECISION:
+                    with autocast(dtype=torch.bfloat16):
+                        vt_pred = self.flow_model(xt, t, labels=labels)  # (B, L, D)
+                        vt_target = eps - compressed  # (B, L, D)
+                        loss = F.mse_loss(vt_pred, vt_target)
+                    # Backward pass with gradient scaling
+                    self.optimizer.zero_grad()
+                    self.scaler.scale(loss).backward()
+                    # Gradient clipping
+                    self.scaler.unscale_(self.optimizer)
+                    torch.nn.utils.clip_grad_norm_(self.flow_model.parameters(), max_norm=GRADIENT_CLIP_NORM)
+                    self.scaler.step(self.optimizer)
+                    self.scaler.update()
+                else:
+                    # Standard training
+                    vt_pred = self.flow_model(xt, t, labels=labels)  # (B, L, D)
+                    vt_target = eps - compressed  # (B, L, D)
+                    loss = F.mse_loss(vt_pred, vt_target)
+                    # Backward pass
+                    self.optimizer.zero_grad()
+                    loss.backward()
+                    # Gradient clipping
+                    torch.nn.utils.clip_grad_norm_(self.flow_model.parameters(), max_norm=GRADIENT_CLIP_NORM)
+                    self.optimizer.step()
+                # Update learning rate
+                self.scheduler.step()
+                epoch_losses.append(loss.item())
+                global_step += 1
+                # Logging
+                if batch_idx % 100 == 0:
+                    current_lr = self.scheduler.get_last_lr()[0]
+                    elapsed_time = time.time() - start_time
+                    steps_per_sec = global_step / elapsed_time
+                    eta_hours = (self.total_steps - global_step) / steps_per_sec / 3600
+                    print(f"Epoch {epoch:4d} | Step {global_step:6d}/{self.total_steps:6d} | "
+                          f"Loss: {loss.item():.6f} | LR: {current_lr:.2e} | "
+                          f"Speed: {steps_per_sec:.1f} steps/s | ETA: {eta_hours:.1f}h")
+                    # Log to wandb
+                    if self.use_wandb:
+                        wandb.log({
+                            'train/loss': loss.item(),
+                            'train/learning_rate': current_lr,
+                            'train/steps_per_sec': steps_per_sec,
+                            'train/global_step': global_step
+                        })
+                # Validation
+                if global_step % self.validation_steps == 0:
+                    val_loss = self._compute_validation_metrics()
+                    val_losses.append(val_loss)
+                    print(f"Validation at step {global_step}: Loss = {val_loss:.6f}")
+                    if self.use_wandb:
+                        wandb.log({
+                            'val/loss': val_loss,
+                            'val/global_step': global_step
+                        })
+                    # Early stopping check
+                    if val_loss < best_loss:
+                        best_loss = val_loss
+                        self._save_checkpoint(epoch, val_loss, global_step, is_final=False, is_best=True)
+            # Compute epoch statistics
+            avg_loss = np.mean(epoch_losses)
+            losses.append(avg_loss)
+            epoch_time = time.time() - epoch_start_time
+            print(f"Epoch {epoch:4d} | Avg Loss: {avg_loss:.6f} | "
+                  f"LR: {self.scheduler.get_last_lr()[0]:.2e} | "
+                  f"Time: {epoch_time:.1f}s | Samples: {len(self.cfg_dataset):,}")
+            # Save checkpoint
+            if (epoch + 1) % CHECKPOINT_INTERVAL == 0:
+                self._save_checkpoint(epoch, avg_loss, global_step, is_final=True)
+        # Save final model
+        self._save_checkpoint(EPOCHS - 1, losses[-1], global_step, is_final=True)
+        total_time = time.time() - start_time
+        print("=" * 60)
+        print("🎉 Optimized Training Complete with FULL DATA!")
+        print(f"Best validation loss: {best_loss:.6f}")
+        print(f"Total training time: {total_time/3600:.1f} hours")
+        print(f"Total samples used: {len(self.cfg_dataset):,}")
+        print(f"Final model saved as: amp_flow_model_final_optimized.pth")
+        return losses, val_losses
+    def _save_checkpoint(self, step, loss, global_step, is_final=False, is_best=False):
+        """Save model checkpoint."""
+        # Create output directory if it doesn't exist
+        output_dir = '/data2/edwardsun/flow_checkpoints'
+        os.makedirs(output_dir, exist_ok=True)
+        if is_best:
+            filename = os.path.join(output_dir, 'amp_flow_model_best_optimized.pth')
+        elif is_final:
+            filename = os.path.join(output_dir, 'amp_flow_model_final_optimized.pth')
+        else:
+            filename = os.path.join(output_dir, f'amp_flow_checkpoint_optimized_step_{step:04d}.pth')
+        checkpoint = {
+            'step': step,
+            'global_step': global_step,
+            'loss': loss,
+            'flow_model_state_dict': self.flow_model.state_dict(),
+            'optimizer_state_dict': self.optimizer.state_dict(),
+            'scheduler_state_dict': self.scheduler.state_dict(),
+            'stats': self.stats,
+            'total_samples': len(self.cfg_dataset),
+            'config': {
+                'batch_size': BATCH_SIZE,
+                'epochs': EPOCHS,
+                'base_lr': BASE_LR,
+                'lr_min': LR_MIN,
+                'warmup_steps': WARMUP_STEPS,
+                'mixed_precision': USE_MIXED_PRECISION,
+                'gradient_clip': GRADIENT_CLIP_NORM,
+                'weight_decay': WEIGHT_DECAY
+            }
+        }
+        torch.save(checkpoint, filename)
+        print(f"✓ Checkpoint saved: {filename} (loss: {loss:.6f}, step: {global_step})")
+def main():
+    """Main training function."""
+    global BATCH_SIZE, EPOCHS
+    parser = argparse.ArgumentParser(description='Optimized Single GPU AMP Flow Training with FULL DATA')
+    parser.add_argument('--embeddings', default='/data2/edwardsun/flow_project/peptide_embeddings/',
+                       help='Path to peptide embeddings directory')
+    parser.add_argument('--cfg_data', default='/data2/edwardsun/flow_project/test_uniprot_processed/uniprot_processed_data.json',
+                       help='Path to FULL CFG data file')
+    parser.add_argument('--use_wandb', action='store_true', help='Use wandb for logging')
+    parser.add_argument('--batch_size', type=int, default=BATCH_SIZE, help='Batch size for training')
+    parser.add_argument('--epochs', type=int, default=EPOCHS, help='Number of training epochs')
+    args = parser.parse_args()
+    # Update global variables if provided
+    if args.batch_size != BATCH_SIZE:
+        BATCH_SIZE = args.batch_size
+    if args.epochs != EPOCHS:
+        EPOCHS = args.epochs
+    print(f"Starting optimized training with batch_size={BATCH_SIZE}, epochs={EPOCHS}")
+    # Initialize trainer
+    trainer = AMPFlowTrainerSingleGPUFullData(args.embeddings, args.cfg_data, args.use_wandb)
+    # Start training
+    losses, val_losses = trainer.train_flow_matching()
+    print("Optimized training completed successfully with FULL DATA!")
+if __name__ == "__main__":
+    main()

apex/AMP_DL_model_twohead.py ADDED Viewed

	@@ -0,0 +1,113 @@

+import numpy as np
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import math, copy, time
+from torch.autograd import Variable
+class PeptideEmbeddings(nn.Module):
+    def __init__(self, emb):
+        super().__init__()
+        self.aa_embedding = nn.Embedding.from_pretrained(torch.FloatTensor(emb), padding_idx=0)
+    def forward(self, x):
+        out = self.aa_embedding(x)
+        return out
+class AMP_model(nn.Module):
+    def __init__(self, emb, emb_size, num_rnn_layers, dim_h, dim_latent, num_fc_layers, num_task):
+        super().__init__()
+        self.peptideEmb = PeptideEmbeddings(emb=emb)
+        self.dim_emb = emb_size
+        self.dim_h = dim_h
+        self.dropout = 0.1
+        self.dim_latent = dim_latent
+        max_len = 52
+        self.rnn = nn.GRU(emb_size, dim_h, num_layers=num_rnn_layers, batch_first=True, dropout=0.1, bidirectional=True)
+        self.layernorm = nn.LayerNorm(dim_h * 2)
+        self.attn1 = nn.Linear(dim_h * 2 + emb_size, max_len)
+        self.attn2 = nn.Linear(dim_h * 2, 1)
+        self.fc0 =  nn.Linear(dim_h * 2, dim_h)
+        self.fc1 = nn.Linear(dim_h, dim_latent)
+        self.fc2 = nn.Linear(dim_latent, int(dim_latent / 2))
+        self.fc3 = nn.Linear(int(dim_latent / 2), int(dim_latent / 4))
+        self.fc4 = nn.Linear(int(dim_latent / 4), num_task)
+        self.ln1 = nn.LayerNorm(dim_latent)
+        self.ln2 = nn.LayerNorm(int(dim_latent / 2))
+        self.ln3 = nn.LayerNorm(int(dim_latent / 4))
+        self.dp1 = nn.Dropout(0.1)#nn.Dropout(0.2)
+        self.dp2 = nn.Dropout(0.1)#nn.Dropout(0.2)
+        self.dp3 = nn.Dropout(0.1)#nn.Dropout(0.2)
+        self.fc1_ = nn.Linear(dim_h, dim_latent)
+        self.fc2_ = nn.Linear(dim_latent, int(dim_latent / 2))
+        self.fc3_ = nn.Linear(int(dim_latent / 2), int(dim_latent / 4))
+        self.fc4_ = nn.Linear(int(dim_latent / 4), 1)
+        self.ln1_ = nn.LayerNorm(dim_latent)
+        self.ln2_ = nn.LayerNorm(int(dim_latent / 2))
+        self.ln3_ = nn.LayerNorm(int(dim_latent / 4))
+        self.dp1_ = nn.Dropout(0.1)#nn.Dropout(0.2)
+        self.dp2_ = nn.Dropout(0.1)#nn.Dropout(0.2)
+        self.dp3_ = nn.Dropout(0.1)#nn.Dropout(0.2)
+    def forward(self, x):
+        x = self.peptideEmb(x)
+        #h = self.initH(x.shape[0])
+        #out, h = self.rnn(x, h)
+        out, h = self.rnn(x)
+        out = self.layernorm(out)
+        attn_weights1 = F.softmax(self.attn1(torch.cat((out, x), 2)), dim=2) #to be tested: masked softmax
+        attn_weights1.permute(0, 2, 1)
+        out = torch.bmm(attn_weights1, out)
+        attn_weights2 = F.softmax(self.attn2(out), dim=1) #to be tested: masked softmax
+        out = torch.sum(attn_weights2 * out, dim=1) #to be test: masked sum
+        out = self.fc0(out)
+        out = self.dp1(F.relu(self.ln1(self.fc1(out))))
+        out = self.dp2(F.relu(self.ln2(self.fc2(out))))
+        out = self.dp3(F.relu(self.ln3(self.fc3(out))))
+        out = self.fc4(out)
+        return F.relu(out)
+    def predict(self, x):
+        return self.forward(x)
+    def cls_forward(self, x):
+        x = self.peptideEmb(x)
+        #h = self.initH(x.shape[0])
+        #out, h = self.rnn(x, h)
+        out, h = self.rnn(x)
+        out = self.layernorm(out)
+        attn_weights1 = F.softmax(self.attn1(torch.cat((out, x), 2)), dim=2) #to be tested: masked softmax
+        attn_weights1.permute(0, 2, 1)
+        out = torch.bmm(attn_weights1, out)
+        attn_weights2 = F.softmax(self.attn2(out), dim=1) #to be tested: masked softmax
+        out = torch.sum(attn_weights2 * out, dim=1) #to be test: masked sum
+        out = self.fc0(out)
+        out = self.dp1_(F.relu(self.ln1_(self.fc1_(out))))
+        out = self.dp2_(F.relu(self.ln2_(self.fc2_(out))))
+        out = self.dp3_(F.relu(self.ln3_(self.fc3_(out))))
+        out = self.fc4_(out)
+        return out

apex/Predicted_MICs.csv ADDED Viewed

	@@ -0,0 +1,11 @@

+,E. coli ATCC11775,P. aeruginosa PAO1,P. aeruginosa PA14,S. aureus ATCC12600,E. coli AIG221,E. coli AIG222,K. pneumoniae ATCC13883,A. baumannii ATCC19606,A. muciniphila ATCC BAA-835,B. fragilis ATCC25285,B. vulgatus ATCC8482,C. aerofaciens ATCC25986,C. scindens ATCC35704,B. thetaiotaomicron ATCC29148,B. thetaiotaomicron Complemmented,B. thetaiotaomicron Mutant,B. uniformis ATCC8492,B. eggerthi ATCC27754,C. spiroforme ATCC29900,P. distasonis ATCC8503,P. copri DSMZ18205,B. ovatus ATCC8483,E. rectale ATCC33656,C. symbiosum,R. obeum,R. torques,S. aureus (ATCC BAA-1556) - MRSA,vancomycin-resistant E. faecalis ATCC700802,vancomycin-resistant E. faecium ATCC700221,E. coli Nissle,Salmonella enterica ATCC 9150 (BEIRES NR-515),Salmonella enterica (BEIRES NR-170),Salmonella enterica ATCC 9150 (BEIRES NR-174),L. monocytogenes ATCC 19111 (BEIRES NR-106)
+IPKTYDKRWDDQCWLAITGRYHGITTPPCCSWVV,134.37933,133.67014,133.47633,132.06357,139.48221,136.32346,137.199,126.84656,126.05934,137.75461,138.1181,142.69162,137.7662,139.89436,133.77281,134.47473,144.46127,133.90617,136.54572,138.02174,131.41934,133.84996,131.85303,302.2381,412.25684,296.79123,137.54033,135.57431,135.85416,2526.2712,100.91771,1219.0903,676.08905,124.86306
+KWLIYYNEGHLMVKYMLTISVRIPEGDNPNIQLHGSIGSR,113.27322,113.816246,105.8549,121.385605,118.87728,117.084915,121.39231,97.408005,114.83742,126.84959,121.16374,117.40218,118.76237,127.41105,124.46525,122.88039,120.520775,116.183304,128.06148,109.68715,118.8102,127.2724,118.91581,295.51852,416.33197,292.758,128.81776,132.28825,108.17039,2471.853,80.87459,1018.7287,640.92346,103.58418
+VGHAQVASPDLHWDGHGNHLIPWTPCYSHEMNPTMPPA,139.44724,136.14822,134.20648,136.02388,141.02017,141.60233,139.74214,136.73692,135.3081,139.60617,140.51639,138.64197,137.41493,141.17766,137.0292,136.66333,143.8462,136.84438,136.74908,135.07846,133.97592,136.8179,137.57657,308.34232,440.25397,311.01697,140.81741,141.47835,136.60622,2779.5024,107.65519,1268.6604,722.2069,134.72772
+RIWETQGSDCIRDGIDSTGPPFMVMFHAAGWRQVHSK,127.36061,130.93207,129.1912,133.74936,130.3024,128.63132,132.07448,114.99402,118.79469,135.40488,129.4568,131.13911,129.87733,135.09549,132.68257,132.721,141.9473,132.37192,133.2458,127.67586,126.89638,133.11191,132.14206,276.87024,390.039,291.6936,138.57545,139.01753,126.76943,2116.5493,90.535255,935.9032,628.7894,112.784706
+IYEDYEFVRMPTHMTDFMQSPDQQNPKHMWTLCFDHT,138.19647,136.43967,134.19327,136.39677,140.33774,140.78696,138.82767,134.45322,135.36095,137.68173,142.58617,141.29187,142.04684,139.79361,135.85521,134.95773,140.02855,135.06549,135.18292,135.65457,126.83917,135.78409,134.22078,295.92462,423.1288,300.80344,139.7175,138.30704,130.73404,2520.869,103.63745,1122.4617,687.5478,133.08084
+CPWVQHFWAPPWAHCICIEGPEESGWATIEPMVVGT,137.11711,136.29323,134.96661,138.4914,142.4903,139.87944,138.16125,131.98262,139.51706,141.34677,142.34552,143.17244,142.15126,143.51427,137.9091,138.8798,148.41785,137.80379,137.64511,136.5797,134.00578,137.73303,140.58127,308.2508,410.4842,304.77887,142.90944,142.9198,144.48164,2608.0327,113.228615,1234.4288,722.4048,132.68794
+FPLTMHGEFSQNLVWTITQHLVKRWCYTLSPKFCHRY,132.82092,130.18073,128.30576,137.0156,136.58134,136.30899,136.51419,119.462135,117.975525,137.00491,138.4794,137.34769,143.57109,138.3506,133.10274,132.72733,138.9192,130.98361,137.97784,130.82622,128.28854,134.46898,138.15057,305.70197,424.16177,313.93192,140.46169,136.51822,135.93553,2973.7498,98.21229,1215.7686,712.4238,124.42888
+SRSEDQILATYWRTSTCYFNQLWFQRLTGQQRICC,132.35309,135.08835,133.4797,133.83403,133.55246,133.56703,133.06903,127.04684,118.50051,137.62326,134.35898,139.048,139.2864,136.87164,133.12827,133.68544,138.84805,135.44177,134.43094,132.8223,130.07999,134.2626,134.12837,265.18292,371.36026,286.07922,137.51228,138.9885,130.92508,2142.12,94.96339,856.95593,609.7506,119.94425
+QLELPCCIETWKLNVAFRCPFHKDLKRLGLYSRDKW,96.86034,112.103455,106.62036,130.5871,97.71516,91.49239,112.85535,70.78958,77.14095,129.05354,115.55772,111.96426,105.05908,126.89999,123.53723,117.43738,116.70744,106.60911,124.693054,112.9424,114.001854,123.950424,114.90995,303.3677,364.20624,275.29742,134.68173,130.36494,114.740585,2476.3755,62.907795,913.19275,507.56738,83.11806
+PPMDCVYAIKTTSDHQSTMFIIPRYTHMYGNLQLWCVYCT,135.86214,137.20145,135.70978,134.45326,140.82416,139.36844,138.8737,130.17923,136.19333,138.40355,136.87325,143.45218,132.80997,140.24698,135.89694,135.31473,144.77373,138.8391,138.01976,137.25049,130.33195,135.40433,133.84612,288.31738,409.42154,300.98975,139.98962,136.1305,135.23387,2705.215,105.53652,1263.1531,691.8714,129.66193

apex/README.md ADDED Viewed

	@@ -0,0 +1,24 @@

+# APEX - Molecular de-extinction of antibiotics enabled by deep learning
+## Predict AMPs using APEX
+By running predict.py, species-specific antmicrobial activties (MICs) of peptides in test_seqs.txt will be generated and saved in Predicted_MICs.csv. To predict antmicrobial activties for novel peptides, you can replace the peptides in test_seqs.txt with the peptides of your interest. Alternatively, you can change line 84 of predict.py to the path of your peptide file. Please make sure that in this file, each line corresponds to a single peptide sequence (<= 50 amino acids in length).
+## Software version
+pytorch: 1.11.0+cu113 (These code should run only on a CUDA-capable device)
+## Configuration
+conda create -n apex python==3.9
+conda activate apex
+pip install torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0 --extra-index-url https://download.pytorch.org/whl/cu113
+pip install -r requirement.txt
+## Running
+python predict.py test_seqs.txt
+## Contacts
+If you have any questions or comments, please feel free to email Fangping Wan (fangping[dot]wan[at]pennmedicine[dot]upenn[dot]edu) and/or César de la Fuente (cfuente[at]pennmedicine[dot]upenn[dot]edu).

apex/aaindex1.csv ADDED Viewed

	@@ -0,0 +1,567 @@

+Description,A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V
+ANDN920101,4.35,4.38,4.75,4.76,4.65,4.37,4.29,3.97,4.63,3.95,4.17,4.36,4.52,4.66,4.44,4.50,4.35,4.70,4.60,3.95
+ARGP820101,0.61,0.60,0.06,0.46,1.07,0.,0.47,0.07,0.61,2.22,1.53,1.15,1.18,2.02,1.95,0.05,0.05,2.65,1.88,1.32
+ARGP820102,1.18,0.20,0.23,0.05,1.89,0.72,0.11,0.49,0.31,1.45,3.23,0.06,2.67,1.96,0.76,0.97,0.84,0.77,0.39,1.08
+ARGP820103,1.56,0.45,0.27,0.14,1.23,0.51,0.23,0.62,0.29,1.67,2.93,0.15,2.96,2.03,0.76,0.81,0.91,1.08,0.68,1.14
+BEGF750101,1.,0.52,0.35,0.44,0.06,0.44,0.73,0.35,0.60,0.73,1.,0.60,1.,0.60,0.06,0.35,0.44,0.73,0.44,0.82
+BEGF750102,0.77,0.72,0.55,0.65,0.65,0.72,0.55,0.65,0.83,0.98,0.83,0.55,0.98,0.98,0.55,0.55,0.83,0.77,0.83,0.98
+BEGF750103,0.37,0.84,0.97,0.97,0.84,0.64,0.53,0.97,0.75,0.37,0.53,0.75,0.64,0.53,0.97,0.84,0.75,0.97,0.84,0.37
+BHAR880101,0.357,0.529,0.463,0.511,0.346,0.493,0.497,0.544,0.323,0.462,0.365,0.466,0.295,0.314,0.509,0.507,0.444,0.305,0.420,0.386
+BIGC670101,52.6,109.1,75.7,68.4,68.3,89.7,84.7,36.3,91.9,102.0,102.0,105.1,97.7,113.9,73.6,54.9,71.2,135.4,116.2,85.1
+BIOV880101,16.,-70.,-74.,-78.,168.,-73.,-106.,-13.,50.,151.,145.,-141.,124.,189.,-20.,-70.,-38.,145.,53.,123.
+BIOV880102,44.,-68.,-72.,-91.,90.,-117.,-139.,-8.,47.,100.,108.,-188.,121.,148.,-36.,-60.,-54.,163.,22.,117.
+BROC820101,7.3,-3.6,-5.7,-2.9,-9.2,-0.3,-7.1,-1.2,-2.1,6.6,20.0,-3.7,5.6,19.2,5.1,-4.1,0.8,16.3,5.9,3.5
+BROC820102,3.9,3.2,-2.8,-2.8,-14.3,1.8,-7.5,-2.3,2.0,11.0,15.0,-2.5,4.1,14.7,5.6,-3.5,1.1,17.8,3.8,2.1
+BULH740101,-0.20,-0.12,0.08,-0.20,-0.45,0.16,-0.30,0.00,-0.12,-2.26,-2.46,-0.35,-1.47,-2.33,-0.98,-0.39,-0.52,-2.01,-2.24,-1.56
+BULH740102,0.691,0.728,0.596,0.558,0.624,0.649,0.632,0.592,0.646,0.809,0.842,0.767,0.709,0.756,0.730,0.594,0.655,0.743,0.743,0.777
+BUNA790101,8.249,8.274,8.747,8.410,8.312,8.411,8.368,8.391,8.415,8.195,8.423,8.408,8.418,8.228,0.,8.380,8.236,8.094,8.183,8.436
+BUNA790102,4.349,4.396,4.755,4.765,4.686,4.373,4.295,3.972,4.630,4.224,4.385,4.358,4.513,4.663,4.471,4.498,4.346,4.702,4.604,4.184
+BUNA790103,6.5,6.9,7.5,7.0,7.7,6.0,7.0,5.6,8.0,7.0,6.5,6.5,0.,9.4,0.,6.5,6.9,0.,6.8,7.0
+BURA740101,0.486,0.262,0.193,0.288,0.200,0.418,0.538,0.120,0.400,0.370,0.420,0.402,0.417,0.318,0.208,0.200,0.272,0.462,0.161,0.379
+BURA740102,0.288,0.362,0.229,0.271,0.533,0.327,0.262,0.312,0.200,0.411,0.400,0.265,0.375,0.318,0.340,0.354,0.388,0.231,0.429,0.495
+CHAM810101,0.52,0.68,0.76,0.76,0.62,0.68,0.68,0.00,0.70,1.02,0.98,0.68,0.78,0.70,0.36,0.53,0.50,0.70,0.70,0.76
+CHAM820101,0.046,0.291,0.134,0.105,0.128,0.180,0.151,0.000,0.230,0.186,0.186,0.219,0.221,0.290,0.131,0.062,0.108,0.409,0.298,0.140
+CHAM820102,-0.368,-1.03,0.,2.06,4.53,0.731,1.77,-0.525,0.,0.791,1.07,0.,0.656,1.06,-2.24,-0.524,0.,1.60,4.91,0.401
+CHAM830101,0.71,1.06,1.37,1.21,1.19,0.87,0.84,1.52,1.07,0.66,0.69,0.99,0.59,0.71,1.61,1.34,1.08,0.76,1.07,0.63
+CHAM830102,-0.118,0.124,0.289,0.048,0.083,-0.105,-0.245,0.104,0.138,0.230,-0.052,0.032,-0.258,0.015,0.,0.225,0.166,0.158,0.094,0.513
+CHAM830103,0.,1.,1.,1.,1.,1.,1.,0.,1.,2.,1.,1.,1.,1.,0.,1.,2.,1.,1.,2.
+CHAM830104,0.,1.,1.,1.,0.,1.,1.,0.,1.,1.,2.,1.,1.,1.,0.,0.,0.,1.,1.,0.
+CHAM830105,0.,1.,0.,0.,0.,1.,1.,0.,1.,0.,0.,1.,1.,1.,0.,0.,0.,1.5,1.,0.
+CHAM830106,0.,5.,2.,2.,1.,3.,3.,0.,3.,2.,2.,4.,3.,4.,0.,1.,1.,5.,5.,1.
+CHAM830107,0.,0.,1.,1.,0.,0.,1.,1.,0.,0.,0.,0.,0.,0.,0.,0.,0.,0.,0.,0.
+CHAM830108,0.,1.,1.,0.,1.,1.,0.,0.,1.,0.,0.,1.,1.,1.,0.,0.,0.,1.,1.,0.
+CHOC750101,91.5,202.0,135.2,124.5,117.7,161.1,155.1,66.4,167.3,168.8,167.9,171.3,170.8,203.4,129.3,99.1,122.1,237.6,203.6,141.7
+CHOC760101,115.,225.,160.,150.,135.,180.,190.,75.,195.,175.,170.,200.,185.,210.,145.,115.,140.,255.,230.,155.
+CHOC760102,25.,90.,63.,50.,19.,71.,49.,23.,43.,18.,23.,97.,31.,24.,50.,44.,47.,32.,60.,18.
+CHOC760103,0.38,0.01,0.12,0.15,0.45,0.07,0.18,0.36,0.17,0.60,0.45,0.03,0.40,0.50,0.18,0.22,0.23,0.27,0.15,0.54
+CHOC760104,0.20,0.00,0.03,0.04,0.22,0.01,0.03,0.18,0.02,0.19,0.16,0.00,0.11,0.14,0.04,0.08,0.08,0.04,0.03,0.18
+CHOP780101,0.66,0.95,1.56,1.46,1.19,0.98,0.74,1.56,0.95,0.47,0.59,1.01,0.60,0.60,1.52,1.43,0.96,0.96,1.14,0.50
+CHOP780201,1.42,0.98,0.67,1.01,0.70,1.11,1.51,0.57,1.00,1.08,1.21,1.16,1.45,1.13,0.57,0.77,0.83,1.08,0.69,1.06
+CHOP780202,0.83,0.93,0.89,0.54,1.19,1.10,0.37,0.75,0.87,1.60,1.30,0.74,1.05,1.38,0.55,0.75,1.19,1.37,1.47,1.70
+CHOP780203,0.74,1.01,1.46,1.52,0.96,0.96,0.95,1.56,0.95,0.47,0.50,1.19,0.60,0.66,1.56,1.43,0.98,0.60,1.14,0.59
+CHOP780204,1.29,0.44,0.81,2.02,0.66,1.22,2.44,0.76,0.73,0.67,0.58,0.66,0.71,0.61,2.01,0.74,1.08,1.47,0.68,0.61
+CHOP780205,1.20,1.25,0.59,0.61,1.11,1.22,1.24,0.42,1.77,0.98,1.13,1.83,1.57,1.10,0.00,0.96,0.75,0.40,0.73,1.25
+CHOP780206,0.70,0.34,1.42,0.98,0.65,0.75,1.04,1.41,1.22,0.78,0.85,1.01,0.83,0.93,1.10,1.55,1.09,0.62,0.99,0.75
+CHOP780207,0.52,1.24,1.64,1.06,0.94,0.70,0.59,1.64,1.86,0.87,0.84,1.49,0.52,1.04,1.58,0.93,0.86,0.16,0.96,0.32
+CHOP780208,0.86,0.90,0.66,0.38,0.87,1.65,0.35,0.63,0.54,1.94,1.30,1.00,1.43,1.50,0.66,0.63,1.17,1.49,1.07,1.69
+CHOP780209,0.75,0.90,1.21,0.85,1.11,0.65,0.55,0.74,0.90,1.35,1.27,0.74,0.95,1.50,0.40,0.79,0.75,1.19,1.96,1.79
+CHOP780210,0.67,0.89,1.86,1.39,1.34,1.09,0.92,1.46,0.78,0.59,0.46,1.09,0.52,0.30,1.58,1.41,1.09,0.48,1.23,0.42
+CHOP780211,0.74,1.05,1.13,1.32,0.53,0.77,0.85,1.68,0.96,0.53,0.59,0.82,0.85,0.44,1.69,1.49,1.16,1.59,1.01,0.59
+CHOP780212,0.060,0.070,0.161,0.147,0.149,0.074,0.056,0.102,0.140,0.043,0.061,0.055,0.068,0.059,0.102,0.120,0.086,0.077,0.082,0.062
+CHOP780213,0.076,0.106,0.083,0.110,0.053,0.098,0.060,0.085,0.047,0.034,0.025,0.115,0.082,0.041,0.301,0.139,0.108,0.013,0.065,0.048
+CHOP780214,0.035,0.099,0.191,0.179,0.117,0.037,0.077,0.190,0.093,0.013,0.036,0.072,0.014,0.065,0.034,0.125,0.065,0.064,0.114,0.028
+CHOP780215,0.058,0.085,0.091,0.081,0.128,0.098,0.064,0.152,0.054,0.056,0.070,0.095,0.055,0.065,0.068,0.106,0.079,0.167,0.125,0.053
+CHOP780216,0.64,1.05,1.56,1.61,0.92,0.84,0.80,1.63,0.77,0.29,0.36,1.13,0.51,0.62,2.04,1.52,0.98,0.48,1.08,0.43
+CIDH920101,-0.45,-0.24,-0.20,-1.52,0.79,-0.99,-0.80,-1.00,1.07,0.76,1.29,-0.36,1.37,1.48,-0.12,-0.98,-0.70,1.38,1.49,1.26
+CIDH920102,-0.08,-0.09,-0.70,-0.71,0.76,-0.40,-1.31,-0.84,0.43,1.39,1.24,-0.09,1.27,1.53,-0.01,-0.93,-0.59,2.25,1.53,1.09
+CIDH920103,0.36,-0.52,-0.90,-1.09,0.70,-1.05,-0.83,-0.82,0.16,2.17,1.18,-0.56,1.21,1.01,-0.06,-0.60,-1.20,1.31,1.05,1.21
+CIDH920104,0.17,-0.70,-0.90,-1.05,1.24,-1.20,-1.19,-0.57,-0.25,2.06,0.96,-0.62,0.60,1.29,-0.21,-0.83,-0.62,1.51,0.66,1.21
+CIDH920105,0.02,-0.42,-0.77,-1.04,0.77,-1.10,-1.14,-0.80,0.26,1.81,1.14,-0.41,1.00,1.35,-0.09,-0.97,-0.77,1.71,1.11,1.13
+COHE430101,0.75,0.70,0.61,0.60,0.61,0.67,0.66,0.64,0.67,0.90,0.90,0.82,0.75,0.77,0.76,0.68,0.70,0.74,0.71,0.86
+CRAJ730101,1.33,0.79,0.72,0.97,0.93,1.42,1.66,0.58,1.49,0.99,1.29,1.03,1.40,1.15,0.49,0.83,0.94,1.33,0.49,0.96
+CRAJ730102,1.00,0.74,0.75,0.89,0.99,0.87,0.37,0.56,0.36,1.75,1.53,1.18,1.40,1.26,0.36,0.65,1.15,0.84,1.41,1.61
+CRAJ730103,0.60,0.79,1.42,1.24,1.29,0.92,0.64,1.38,0.95,0.67,0.70,1.10,0.67,1.05,1.47,1.26,1.05,1.23,1.35,0.48
+DAWD720101,2.5,7.5,5.0,2.5,3.0,6.0,5.0,0.5,6.0,5.5,5.5,7.0,6.0,6.5,5.5,3.0,5.0,7.0,7.0,5.0
+DAYM780101,8.6,4.9,4.3,5.5,2.9,3.9,6.0,8.4,2.0,4.5,7.4,6.6,1.7,3.6,5.2,7.0,6.1,1.3,3.4,6.6
+DAYM780201,100.,65.,134.,106.,20.,93.,102.,49.,66.,96.,40.,56.,94.,41.,56.,120.,97.,18.,41.,74.
+DESM900101,1.56,0.59,0.51,0.23,1.80,0.39,0.19,1.03,1.,1.27,1.38,0.15,1.93,1.42,0.27,0.96,1.11,0.91,1.10,1.58
+DESM900102,1.26,0.38,0.59,0.27,1.60,0.39,0.23,1.08,1.,1.44,1.36,0.33,1.52,1.46,0.54,0.98,1.01,1.06,0.89,1.33
+EISD840101,0.25,-1.76,-0.64,-0.72,0.04,-0.69,-0.62,0.16,-0.40,0.73,0.53,-1.10,0.26,0.61,-0.07,-0.26,-0.18,0.37,0.02,0.54
+EISD860101,0.67,-2.1,-0.6,-1.2,0.38,-0.22,-0.76,0.,0.64,1.9,1.9,-0.57,2.4,2.3,1.2,0.01,0.52,2.6,1.6,1.5
+EISD860102,0.,10.,1.3,1.9,0.17,1.9,3.,0.,0.99,1.2,1.0,5.7,1.9,1.1,0.18,0.73,1.5,1.6,1.8,0.48
+EISD860103,0.,-0.96,-0.86,-0.98,0.76,-1.0,-0.89,0.,-0.75,0.99,0.89,-0.99,0.94,0.92,0.22,-0.67,0.09,0.67,-0.93,0.84
+FASG760101,89.09,174.20,132.12,133.10,121.15,146.15,147.13,75.07,155.16,131.17,131.17,146.19,149.21,165.19,115.13,105.09,119.12,204.24,181.19,117.15
+FASG760102,297.,238.,236.,270.,178.,185.,249.,290.,277.,284.,337.,224.,283.,284.,222.,228.,253.,282.,344.,293.
+FASG760103,1.80,12.50,-5.60,5.05,-16.50,6.30,12.00,0.00,-38.50,12.40,-11.00,14.60,-10.00,-34.50,-86.20,-7.50,-28.00,-33.70,-10.00,5.63
+FASG760104,9.69,8.99,8.80,9.60,8.35,9.13,9.67,9.78,9.17,9.68,9.60,9.18,9.21,9.18,10.64,9.21,9.10,9.44,9.11,9.62
+FASG760105,2.34,1.82,2.02,1.88,1.92,2.17,2.10,2.35,1.82,2.36,2.36,2.16,2.28,2.16,1.95,2.19,2.09,2.43,2.20,2.32
+FAUJ830101,0.31,-1.01,-0.60,-0.77,1.54,-0.22,-0.64,0.00,0.13,1.80,1.70,-0.99,1.23,1.79,0.72,-0.04,0.26,2.25,0.96,1.22
+FAUJ880101,1.28,2.34,1.60,1.60,1.77,1.56,1.56,0.00,2.99,4.19,2.59,1.89,2.35,2.94,2.67,1.31,3.03,3.21,2.94,3.67
+FAUJ880102,0.53,0.69,0.58,0.59,0.66,0.71,0.72,0.00,0.64,0.96,0.92,0.78,0.77,0.71,0.,0.55,0.63,0.84,0.71,0.89
+FAUJ880103,1.00,6.13,2.95,2.78,2.43,3.95,3.78,0.00,4.66,4.00,4.00,4.77,4.43,5.89,2.72,1.60,2.60,8.08,6.47,3.00
+FAUJ880104,2.87,7.82,4.58,4.74,4.47,6.11,5.97,2.06,5.23,4.92,4.92,6.89,6.36,4.62,4.11,3.97,4.11,7.68,4.73,4.11
+FAUJ880105,1.52,1.52,1.52,1.52,1.52,1.52,1.52,1.00,1.52,1.90,1.52,1.52,1.52,1.52,1.52,1.52,1.73,1.52,1.52,1.90
+FAUJ880106,2.04,6.24,4.37,3.78,3.41,3.53,3.31,1.00,5.66,3.49,4.45,4.87,4.80,6.02,4.31,2.70,3.17,5.90,6.72,3.17
+FAUJ880107,7.3,11.1,8.0,9.2,14.4,10.6,11.4,0.0,10.2,16.1,10.1,10.9,10.4,13.9,17.8,13.1,16.7,13.2,13.9,17.2
+FAUJ880108,-0.01,0.04,0.06,0.15,0.12,0.05,0.07,0.00,0.08,-0.01,-0.01,0.00,0.04,0.03,0.,0.11,0.04,0.00,0.03,0.01
+FAUJ880109,0.,4.,2.,1.,0.,2.,1.,0.,1.,0.,0.,2.,0.,0.,0.,1.,1.,1.,1.,0.
+FAUJ880110,0.,3.,3.,4.,0.,3.,4.,0.,1.,0.,0.,1.,0.,0.,0.,2.,2.,0.,2.,0.
+FAUJ880111,0.,1.,0.,0.,0.,0.,0.,0.,1.,0.,0.,1.,0.,0.,0.,0.,0.,0.,0.,0.
+FAUJ880112,0.,0.,0.,1.,0.,0.,1.,0.,0.,0.,0.,0.,0.,0.,0.,0.,0.,0.,0.,0.
+FAUJ880113,4.76,4.30,3.64,5.69,3.67,4.54,5.48,3.77,2.84,4.81,4.79,4.27,4.25,4.31,0.,3.83,3.87,4.75,4.30,4.86
+FINA770101,1.08,1.05,0.85,0.85,0.95,0.95,1.15,0.55,1.00,1.05,1.25,1.15,1.15,1.10,0.71,0.75,0.75,1.10,1.10,0.95
+FINA910101,1.,0.70,1.70,3.20,1.,1.,1.70,1.,1.,0.60,1.,0.70,1.,1.,1.,1.70,1.70,1.,1.,0.60
+FINA910102,1.,0.70,1.,1.70,1.,1.,1.70,1.30,1.,1.,1.,0.70,1.,1.,13.,1.,1.,1.,1.,1.
+FINA910103,1.20,1.70,1.20,0.70,1.,1.,0.70,0.80,1.20,0.80,1.,1.70,1.,1.,1.,1.50,1.,1.,1.,0.80
+FINA910104,1.,1.70,1.,0.70,1.,1.,0.70,1.50,1.,1.,1.,1.70,1.,1.,0.10,1.,1.,1.,1.,1.
+GARJ730101,0.28,0.10,0.25,0.21,0.28,0.35,0.33,0.17,0.21,0.82,1.00,0.09,0.74,2.18,0.39,0.12,0.21,5.70,1.26,0.60
+GEIM800101,1.29,1.,0.81,1.10,0.79,1.07,1.49,0.63,1.33,1.05,1.31,1.33,1.54,1.13,0.63,0.78,0.77,1.18,0.71,0.81
+GEIM800102,1.13,1.09,1.06,0.94,1.32,0.93,1.20,0.83,1.09,1.05,1.13,1.08,1.23,1.01,0.82,1.01,1.17,1.32,0.88,1.13
+GEIM800103,1.55,0.20,1.20,1.55,1.44,1.13,1.67,0.59,1.21,1.27,1.25,1.20,1.37,0.40,0.21,1.01,0.55,1.86,1.08,0.64
+GEIM800104,1.19,1.,0.94,1.07,0.95,1.32,1.64,0.60,1.03,1.12,1.18,1.27,1.49,1.02,0.68,0.81,0.85,1.18,0.77,0.74
+GEIM800105,0.84,1.04,0.66,0.59,1.27,1.02,0.57,0.94,0.81,1.29,1.10,0.86,0.88,1.15,0.80,1.05,1.20,1.15,1.39,1.56
+GEIM800106,0.86,1.15,0.60,0.66,0.91,1.11,0.37,0.86,1.07,1.17,1.28,1.01,1.15,1.34,0.61,0.91,1.14,1.13,1.37,1.31
+GEIM800107,0.91,0.99,0.72,0.74,1.12,0.90,0.41,0.91,1.01,1.29,1.23,0.86,0.96,1.26,0.65,0.93,1.05,1.15,1.21,1.58
+GEIM800108,0.91,1.,1.64,1.40,0.93,0.94,0.97,1.51,0.90,0.65,0.59,0.82,0.58,0.72,1.66,1.23,1.04,0.67,0.92,0.60
+GEIM800109,0.80,0.96,1.10,1.60,0.,1.60,0.40,2.,0.96,0.85,0.80,0.94,0.39,1.20,2.10,1.30,0.60,0.,1.80,0.80
+GEIM800110,1.10,0.93,1.57,1.41,1.05,0.81,1.40,1.30,0.85,0.67,0.52,0.94,0.69,0.60,1.77,1.13,0.88,0.62,0.41,0.58
+GEIM800111,0.93,1.01,1.36,1.22,0.92,0.83,1.05,1.45,0.96,0.58,0.59,0.91,0.60,0.71,1.67,1.25,1.08,0.68,0.98,0.62
+GOLD730101,0.75,0.75,0.69,0.00,1.00,0.59,0.00,0.00,0.00,2.95,2.40,1.50,1.30,2.65,2.60,0.00,0.45,3.00,2.85,1.70
+GOLD730102,88.3,181.2,125.1,110.8,112.4,148.7,140.5,60.0,152.6,168.5,168.5,175.6,162.2,189.0,122.2,88.7,118.2,227.0,193.0,141.4
+GRAR740101,0.00,0.65,1.33,1.38,2.75,0.89,0.92,0.74,0.58,0.00,0.00,0.33,0.00,0.00,0.39,1.42,0.71,0.13,0.20,0.00
+GRAR740102,8.1,10.5,11.6,13.0,5.5,10.5,12.3,9.0,10.4,5.2,4.9,11.3,5.7,5.2,8.0,9.2,8.6,5.4,6.2,5.9
+GRAR740103,31.,124.,56.,54.,55.,85.,83.,3.,96.,111.,111.,119.,105.,132.,32.5,32.,61.,170.,136.,84.
+GUYH850101,0.10,1.91,0.48,0.78,-1.42,0.95,0.83,0.33,-0.50,-1.13,-1.18,1.40,-1.59,-2.12,0.73,0.52,0.07,-0.51,-0.21,-1.27
+HOPA770101,1.0,2.3,2.2,6.5,0.1,2.1,6.2,1.1,2.8,0.8,0.8,5.3,0.7,1.4,0.9,1.7,1.5,1.9,2.1,0.9
+HOPT810101,-0.5,3.0,0.2,3.0,-1.0,0.2,3.0,0.0,-0.5,-1.8,-1.8,3.0,-1.3,-2.5,0.0,0.3,-0.4,-3.4,-2.3,-1.5
+HUTJ700101,29.22,26.37,38.30,37.09,50.70,44.02,41.84,23.71,59.64,45.00,48.03,57.10,69.32,48.52,36.13,32.40,35.20,56.92,51.73,40.35
+HUTJ700102,30.88,68.43,41.70,40.66,53.83,46.62,44.98,24.74,65.99,49.71,50.62,63.21,55.32,51.06,39.21,35.65,36.50,60.00,51.15,42.75
+HUTJ700103,154.33,341.01,207.90,194.91,219.79,235.51,223.16,127.90,242.54,233.21,232.30,300.46,202.65,204.74,179.93,174.06,205.80,237.01,229.15,207.60
+ISOY800101,1.53,1.17,0.60,1.00,0.89,1.27,1.63,0.44,1.03,1.07,1.32,1.26,1.66,1.22,0.25,0.65,0.86,1.05,0.70,0.93
+ISOY800102,0.86,0.98,0.74,0.69,1.39,0.89,0.66,0.70,1.06,1.31,1.01,0.77,1.06,1.16,1.16,1.09,1.24,1.17,1.28,1.40
+ISOY800103,0.78,1.06,1.56,1.50,0.60,0.78,0.97,1.73,0.83,0.40,0.57,1.01,0.30,0.67,1.55,1.19,1.09,0.74,1.14,0.44
+ISOY800104,1.09,0.97,1.14,0.77,0.50,0.83,0.92,1.25,0.67,0.66,0.44,1.25,0.45,0.50,2.96,1.21,1.33,0.62,0.94,0.56
+ISOY800105,0.35,0.75,2.12,2.16,0.50,0.73,0.65,2.40,1.19,0.12,0.58,0.83,0.22,0.89,0.43,1.24,0.85,0.62,1.44,0.43
+ISOY800106,1.09,1.07,0.88,1.24,1.04,1.09,1.14,0.27,1.07,0.97,1.30,1.20,0.55,0.80,1.78,1.20,0.99,1.03,0.69,0.77
+ISOY800107,1.34,2.78,0.92,1.77,1.44,0.79,2.54,0.95,0.00,0.52,1.05,0.79,0.00,0.43,0.37,0.87,1.14,1.79,0.73,0.00
+ISOY800108,0.47,0.52,2.16,1.15,0.41,0.95,0.64,3.03,0.89,0.62,0.53,0.98,0.68,0.61,0.63,1.03,0.39,0.63,0.83,0.76
+JANJ780101,27.8,94.7,60.1,60.6,15.5,68.7,68.2,24.5,50.7,22.8,27.6,103.0,33.5,25.5,51.5,42.0,45.0,34.7,55.2,23.7
+JANJ780102,51.,5.,22.,19.,74.,16.,16.,52.,34.,66.,60.,3.,52.,58.,25.,35.,30.,49.,24.,64.
+JANJ780103,15.,67.,49.,50.,5.,56.,55.,10.,34.,13.,16.,85.,20.,10.,45.,32.,32.,17.,41.,14.
+JANJ790101,1.7,0.1,0.4,0.4,4.6,0.3,0.3,1.8,0.8,3.1,2.4,0.05,1.9,2.2,0.6,0.8,0.7,1.6,0.5,2.9
+JANJ790102,0.3,-1.4,-0.5,-0.6,0.9,-0.7,-0.7,0.3,-0.1,0.7,0.5,-1.8,0.4,0.5,-0.3,-0.1,-0.2,0.3,-0.4,0.6
+JOND750101,0.87,0.85,0.09,0.66,1.52,0.00,0.67,0.10,0.87,3.15,2.17,1.64,1.67,2.87,2.77,0.07,0.07,3.77,2.67,1.87
+JOND750102,2.34,1.18,2.02,2.01,1.65,2.17,2.19,2.34,1.82,2.36,2.36,2.18,2.28,1.83,1.99,2.21,2.10,2.38,2.20,2.32
+JOND920101,0.077,0.051,0.043,0.052,0.020,0.041,0.062,0.074,0.023,0.053,0.091,0.059,0.024,0.040,0.051,0.069,0.059,0.014,0.032,0.066
+JOND920102,100.,83.,104.,86.,44.,84.,77.,50.,91.,103.,54.,72.,93.,51.,58.,117.,107.,25.,50.,98.
+JUKT750101,5.3,2.6,3.0,3.6,1.3,2.4,3.3,4.8,1.4,3.1,4.7,4.1,1.1,2.3,2.5,4.5,3.7,0.8,2.3,4.2
+JUNJ780101,685.,382.,397.,400.,241.,313.,427.,707.,155.,394.,581.,575.,132.,303.,366.,593.,490.,99.,292.,553.
+KANM800101,1.36,1.00,0.89,1.04,0.82,1.14,1.48,0.63,1.11,1.08,1.21,1.22,1.45,1.05,0.52,0.74,0.81,0.97,0.79,0.94
+KANM800102,0.81,0.85,0.62,0.71,1.17,0.98,0.53,0.88,0.92,1.48,1.24,0.77,1.05,1.20,0.61,0.92,1.18,1.18,1.23,1.66
+KANM800103,1.45,1.15,0.64,0.91,0.70,1.14,1.29,0.53,1.13,1.23,1.56,1.27,1.83,1.20,0.21,0.48,0.77,1.17,0.74,1.10
+KANM800104,0.75,0.79,0.33,0.31,1.46,0.75,0.46,0.83,0.83,1.87,1.56,0.66,0.86,1.37,0.52,0.82,1.36,0.79,1.08,2.00
+KARP850101,1.041,1.038,1.117,1.033,0.960,1.165,1.094,1.142,0.982,1.002,0.967,1.093,0.947,0.930,1.055,1.169,1.073,0.925,0.961,0.982
+KARP850102,0.946,1.028,1.006,1.089,0.878,1.025,1.036,1.042,0.952,0.892,0.961,1.082,0.862,0.912,1.085,1.048,1.051,0.917,0.930,0.927
+KARP850103,0.892,0.901,0.930,0.932,0.925,0.885,0.933,0.923,0.894,0.872,0.921,1.057,0.804,0.914,0.932,0.923,0.934,0.803,0.837,0.913
+KHAG800101,49.1,133.,-3.6,0.,0.,20.,0.,64.6,75.7,18.9,15.6,0.,6.8,54.7,43.8,44.4,31.0,70.5,0.,29.5
+KLEP840101,0.,1.,0.,-1.,0.,0.,-1.,0.,0.,0.,0.,1.,0.,0.,0.,0.,0.,0.,0.,0.
+KRIW710101,4.60,6.50,5.90,5.70,-1.00,6.10,5.60,7.60,4.50,2.60,3.25,7.90,1.40,3.20,7.00,5.25,4.80,4.00,4.35,3.40
+KRIW790101,4.32,6.55,6.24,6.04,1.73,6.13,6.17,6.09,5.66,2.31,3.93,7.92,2.44,2.59,7.19,5.37,5.16,2.78,3.58,3.31
+KRIW790102,0.28,0.34,0.31,0.33,0.11,0.39,0.37,0.28,0.23,0.12,0.16,0.59,0.08,0.10,0.46,0.27,0.26,0.15,0.25,0.22
+KRIW790103,27.5,105.0,58.7,40.0,44.6,80.7,62.0,0.0,79.0,93.5,93.5,100.0,94.1,115.5,41.9,29.3,51.3,145.5,117.3,71.5
+KYTJ820101,1.8,-4.5,-3.5,-3.5,2.5,-3.5,-3.5,-0.4,-3.2,4.5,3.8,-3.9,1.9,2.8,-1.6,-0.8,-0.7,-0.9,-1.3,4.2
+LAWE840101,-0.48,-0.06,-0.87,-0.75,-0.32,-0.32,-0.71,0.00,-0.51,0.81,1.02,-0.09,0.81,1.03,2.03,0.05,-0.35,0.66,1.24,0.56
+LEVM760101,-0.5,3.0,0.2,2.5,-1.0,0.2,2.5,0.0,-0.5,-1.8,-1.8,3.0,-1.3,-2.5,-1.4,0.3,-0.4,-3.4,-2.3,-1.5
+LEVM760102,0.77,3.72,1.98,1.99,1.38,2.58,2.63,0.00,2.76,1.83,2.08,2.94,2.34,2.97,1.42,1.28,1.43,3.58,3.36,1.49
+LEVM760103,121.9,121.4,117.5,121.2,113.7,118.0,118.2,0.,118.2,118.9,118.1,122.0,113.1,118.2,81.9,117.9,117.1,118.4,110.0,121.7
+LEVM760104,243.2,206.6,207.1,215.0,209.4,205.4,213.6,300.0,219.9,217.9,205.6,210.9,204.0,203.7,237.4,232.0,226.7,203.7,195.6,220.3
+LEVM760105,0.77,2.38,1.45,1.43,1.22,1.75,1.77,0.58,1.78,1.56,1.54,2.08,1.80,1.90,1.25,1.08,1.24,2.21,2.13,1.29
+LEVM760106,5.2,6.0,5.0,5.0,6.1,6.0,6.0,4.2,6.0,7.0,7.0,6.0,6.8,7.1,6.2,4.9,5.0,7.6,7.1,6.4
+LEVM760107,0.025,0.20,0.10,0.10,0.10,0.10,0.10,0.025,0.10,0.19,0.19,0.20,0.19,0.39,0.17,0.025,0.10,0.56,0.39,0.15
+LEVM780101,1.29,0.96,0.90,1.04,1.11,1.27,1.44,0.56,1.22,0.97,1.30,1.23,1.47,1.07,0.52,0.82,0.82,0.99,0.72,0.91
+LEVM780102,0.90,0.99,0.76,0.72,0.74,0.80,0.75,0.92,1.08,1.45,1.02,0.77,0.97,1.32,0.64,0.95,1.21,1.14,1.25,1.49
+LEVM780103,0.77,0.88,1.28,1.41,0.81,0.98,0.99,1.64,0.68,0.51,0.58,0.96,0.41,0.59,1.91,1.32,1.04,0.76,1.05,0.47
+LEVM780104,1.32,0.98,0.95,1.03,0.92,1.10,1.44,0.61,1.31,0.93,1.31,1.25,1.39,1.02,0.58,0.76,0.79,0.97,0.73,0.93
+LEVM780105,0.86,0.97,0.73,0.69,1.04,1.00,0.66,0.89,0.85,1.47,1.04,0.77,0.93,1.21,0.68,1.02,1.27,1.26,1.31,1.43
+LEVM780106,0.79,0.90,1.25,1.47,0.79,0.92,1.02,1.67,0.81,0.50,0.57,0.99,0.51,0.77,1.78,1.30,0.97,0.79,0.93,0.46
+LEWP710101,0.22,0.28,0.42,0.73,0.20,0.26,0.08,0.58,0.14,0.22,0.19,0.27,0.38,0.08,0.46,0.55,0.49,0.43,0.46,0.08
+LIFS790101,0.92,0.93,0.60,0.48,1.16,0.95,0.61,0.61,0.93,1.81,1.30,0.70,1.19,1.25,0.40,0.82,1.12,1.54,1.53,1.81
+LIFS790102,1.00,0.68,0.54,0.50,0.91,0.28,0.59,0.79,0.38,2.60,1.42,0.59,1.49,1.30,0.35,0.70,0.59,0.89,1.08,2.63
+LIFS790103,0.90,1.02,0.62,0.47,1.24,1.18,0.62,0.56,1.12,1.54,1.26,0.74,1.09,1.23,0.42,0.87,1.30,1.75,1.68,1.53
+MANP780101,12.97,11.72,11.42,10.85,14.63,11.76,11.89,12.43,12.16,15.67,14.90,11.36,14.39,14.00,11.37,11.23,11.69,13.93,13.42,15.71
+MAXF760101,1.43,1.18,0.64,0.92,0.94,1.22,1.67,0.46,0.98,1.04,1.36,1.27,1.53,1.19,0.49,0.70,0.78,1.01,0.69,0.98
+MAXF760102,0.86,0.94,0.74,0.72,1.17,0.89,0.62,0.97,1.06,1.24,0.98,0.79,1.08,1.16,1.22,1.04,1.18,1.07,1.25,1.33
+MAXF760103,0.64,0.62,3.14,1.92,0.32,0.80,1.01,0.63,2.05,0.92,0.37,0.89,1.07,0.86,0.50,1.01,0.92,1.00,1.31,0.87
+MAXF760104,0.17,0.76,2.62,1.08,0.95,0.91,0.28,5.02,0.57,0.26,0.21,1.17,0.00,0.28,0.12,0.57,0.23,0.00,0.97,0.24
+MAXF760105,1.13,0.48,1.11,1.18,0.38,0.41,1.02,3.84,0.30,0.40,0.65,1.13,0.00,0.45,0.00,0.81,0.71,0.93,0.38,0.48
+MAXF760106,1.00,1.18,0.87,1.39,1.09,1.13,1.04,0.46,0.71,0.68,1.01,1.05,0.36,0.65,1.95,1.56,1.23,1.10,0.87,0.58
+MCMT640101,4.34,26.66,13.28,12.00,35.77,17.56,17.26,0.00,21.81,19.06,18.78,21.29,21.64,29.40,10.93,6.35,11.01,42.53,31.53,13.92
+MEEJ800101,0.5,0.8,0.8,-8.2,-6.8,-4.8,-16.9,0.0,-3.5,13.9,8.8,0.1,4.8,13.2,6.1,1.2,2.7,14.9,6.1,2.7
+MEEJ800102,-0.1,-4.5,-1.6,-2.8,-2.2,-2.5,-7.5,-0.5,0.8,11.8,10.0,-3.2,7.1,13.9,8.0,-3.7,1.5,18.1,8.2,3.3
+MEEJ810101,1.1,-0.4,-4.2,-1.6,7.1,-2.9,0.7,-0.2,-0.7,8.5,11.0,-1.9,5.4,13.4,4.4,-3.2,-1.7,17.1,7.4,5.9
+MEEJ810102,1.0,-2.0,-3.0,-0.5,4.6,-2.0,1.1,0.2,-2.2,7.0,9.6,-3.0,4.0,12.6,3.1,-2.9,-0.6,15.1,6.7,4.6
+MEIH800101,0.93,0.98,0.98,1.01,0.88,1.02,1.02,1.01,0.89,0.79,0.85,1.05,0.84,0.78,1.00,1.02,0.99,0.83,0.93,0.81
+MEIH800102,0.94,1.09,1.04,1.08,0.84,1.11,1.12,1.01,0.92,0.76,0.82,1.23,0.83,0.73,1.04,1.04,1.02,0.87,1.03,0.81
+MEIH800103,87.,81.,70.,71.,104.,66.,72.,90.,90.,105.,104.,65.,100.,108.,78.,83.,83.,94.,83.,94.
+MIYS850101,2.36,1.92,1.70,1.67,3.36,1.75,1.74,2.06,2.41,4.17,3.93,1.23,4.22,4.37,1.89,1.81,2.04,3.82,2.91,3.49
+NAGK730101,1.29,0.83,0.77,1.00,0.94,1.10,1.54,0.72,1.29,0.94,1.23,1.23,1.23,1.23,0.70,0.78,0.87,1.06,0.63,0.97
+NAGK730102,0.96,0.67,0.72,0.90,1.13,1.18,0.33,0.90,0.87,1.54,1.26,0.81,1.29,1.37,0.75,0.77,1.23,1.13,1.07,1.41
+NAGK730103,0.72,1.33,1.38,1.04,1.01,0.81,0.75,1.35,0.76,0.80,0.63,0.84,0.62,0.58,1.43,1.34,1.03,0.87,1.35,0.83
+NAKH900101,7.99,5.86,4.33,5.14,1.81,3.98,6.10,6.91,2.17,5.48,9.16,6.01,2.50,3.83,4.95,6.84,5.77,1.34,3.15,6.65
+NAKH900102,3.73,3.34,2.33,2.23,2.30,2.36,3.,3.36,1.55,2.52,3.40,3.36,1.37,1.94,3.18,2.83,2.63,1.15,1.76,2.53
+NAKH900103,5.74,1.92,5.25,2.11,1.03,2.30,2.63,5.66,2.30,9.12,15.36,3.20,5.30,6.51,4.79,7.55,7.51,2.51,4.08,5.12
+NAKH900104,-0.60,-1.18,0.39,-1.36,-0.34,-0.71,-1.16,-0.37,0.08,1.44,1.82,-0.84,2.04,1.38,-0.05,0.25,0.66,1.02,0.53,-0.60
+NAKH900105,5.88,1.54,4.38,1.70,1.11,2.30,2.60,5.29,2.33,8.78,16.52,2.58,6.00,6.58,5.29,7.68,8.38,2.89,3.51,4.66
+NAKH900106,-0.57,-1.29,0.02,-1.54,-0.30,-0.71,-1.17,-0.48,0.10,1.31,2.16,-1.02,2.55,1.42,0.11,0.30,0.99,1.35,0.20,-0.79
+NAKH900107,5.39,2.81,7.31,3.07,0.86,2.31,2.70,6.52,2.23,9.94,12.64,4.67,3.68,6.34,3.62,7.24,5.44,1.64,5.42,6.18
+NAKH900108,-0.70,-0.91,1.28,-0.93,-0.41,-0.71,-1.13,-0.12,0.04,1.77,1.02,-0.40,0.86,1.29,-0.42,0.14,-0.13,0.26,1.29,-0.19
+NAKH900109,9.25,3.96,3.71,3.89,1.07,3.17,4.80,8.51,1.88,6.47,10.94,3.50,3.14,6.36,4.36,6.26,5.66,2.22,3.28,7.55
+NAKH900110,0.34,-0.57,-0.27,-0.56,-0.32,-0.34,-0.43,0.48,-0.19,0.39,0.52,-0.75,0.47,1.30,-0.19,-0.20,-0.04,0.77,0.07,0.36
+NAKH900111,10.17,1.21,1.36,1.18,1.48,1.57,1.15,8.87,1.07,10.91,16.22,1.04,4.12,9.60,2.24,5.38,5.61,2.67,2.68,11.44
+NAKH900112,6.61,0.41,1.84,0.59,0.83,1.20,1.63,4.88,1.14,12.91,21.66,1.15,7.17,7.76,3.51,6.84,8.89,2.11,2.57,6.30
+NAKH900113,1.61,0.40,0.73,0.75,0.37,0.61,1.50,3.12,0.46,1.61,1.37,0.62,1.59,1.24,0.67,0.68,0.92,1.63,0.67,1.30
+NAKH920101,8.63,6.75,4.18,6.24,1.03,4.76,7.82,6.80,2.70,3.48,8.44,6.25,2.14,2.73,6.28,8.53,4.43,0.80,2.54,5.44
+NAKH920102,10.88,6.01,5.75,6.13,0.69,4.68,9.34,7.72,2.15,1.80,8.03,6.11,3.79,2.93,7.21,7.25,3.51,0.47,1.01,4.57
+NAKH920103,5.15,4.38,4.81,5.75,3.24,4.45,7.05,6.38,2.69,4.40,8.11,5.25,1.60,3.52,5.65,8.04,7.41,1.68,3.42,7.00
+NAKH920104,5.04,3.73,5.94,5.26,2.20,4.50,6.07,7.09,2.99,4.32,9.88,6.31,1.85,3.72,6.22,8.05,5.20,2.10,3.32,6.19
+NAKH920105,9.90,0.09,0.94,0.35,2.55,0.87,0.08,8.14,0.20,15.25,22.28,0.16,1.85,6.47,2.38,4.17,4.33,2.21,3.42,14.34
+NAKH920106,6.69,6.65,4.49,4.97,1.70,5.39,7.76,6.32,2.11,4.51,8.23,8.36,2.46,3.59,5.20,7.40,5.18,1.06,2.75,5.27
+NAKH920107,5.08,4.75,5.75,5.96,2.95,4.24,6.04,8.20,2.10,4.95,8.03,4.93,2.61,4.36,4.84,6.41,5.87,2.31,4.55,6.07
+NAKH920108,9.36,0.27,2.31,0.94,2.56,1.14,0.94,6.17,0.47,13.73,16.64,0.58,3.93,10.99,1.96,5.58,4.68,2.20,3.13,12.43
+NISK800101,0.23,-0.26,-0.94,-1.13,1.78,-0.57,-0.75,-0.07,0.11,1.19,1.03,-1.05,0.66,0.48,-0.76,-0.67,-0.36,0.90,0.59,1.24
+NISK860101,-0.22,-0.93,-2.65,-4.12,4.66,-2.76,-3.64,-1.62,1.28,5.58,5.01,-4.18,3.51,5.27,-3.03,-2.84,-1.20,5.20,2.15,4.45
+NOZY710101,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,1.8,1.8,0.0,1.3,2.5,0.0,0.0,0.4,3.4,2.3,1.5
+OOBM770101,-1.895,-1.475,-1.560,-1.518,-2.035,-1.521,-1.535,-1.898,-1.755,-1.951,-1.966,-1.374,-1.963,-1.864,-1.699,-1.753,-1.767,-1.869,-1.686,-1.981
+OOBM770102,-1.404,-0.921,-1.178,-1.162,-1.365,-1.116,-1.163,-1.364,-1.215,-1.189,-1.315,-1.074,-1.303,-1.135,-1.236,-1.297,-1.252,-1.030,-1.030,-1.254
+OOBM770103,-0.491,-0.554,-0.382,-0.356,-0.670,-0.405,-0.371,-0.534,-0.540,-0.762,-0.650,-0.300,-0.659,-0.729,-0.463,-0.455,-0.515,-0.839,-0.656,-0.728
+OOBM770104,-9.475,-16.225,-12.480,-12.144,-12.210,-13.689,-13.815,-7.592,-17.550,-15.608,-15.728,-12.366,-15.704,-20.504,-11.893,-10.518,-12.369,-26.166,-20.232,-13.867
+OOBM770105,-7.020,-10.131,-9.424,-9.296,-8.190,-10.044,-10.467,-5.456,-12.150,-9.512,-10.520,-9.666,-10.424,-12.485,-8.652,-7.782,-8.764,-14.420,-12.360,-8.778
+OOBM850101,2.01,0.84,0.03,-2.05,1.98,1.02,0.93,0.12,-0.14,3.70,2.73,2.55,1.75,2.68,0.41,1.47,2.39,2.49,2.23,3.50
+OOBM850102,1.34,0.95,2.49,3.32,1.07,1.49,2.20,2.07,1.27,0.66,0.54,0.61,0.70,0.80,2.12,0.94,1.09,-4.65,-0.17,1.32
+OOBM850103,0.46,-1.54,1.31,-0.33,0.20,-1.12,0.48,0.64,-1.31,3.28,0.43,-1.71,0.15,0.52,-0.58,-0.83,-1.52,1.25,-2.21,0.54
+OOBM850104,-2.49,2.55,2.27,8.86,-3.13,1.79,4.04,-0.56,4.22,-10.87,-7.16,-9.97,-4.96,-6.64,5.19,-1.60,-4.75,-17.84,9.25,-3.97
+OOBM850105,4.55,5.97,5.56,2.85,-0.78,4.15,5.16,9.14,4.48,2.10,3.24,10.68,2.18,4.37,5.14,6.78,8.60,1.97,2.40,3.81
+PALJ810101,1.30,0.93,0.90,1.02,0.92,1.04,1.43,0.63,1.33,0.87,1.30,1.23,1.32,1.09,0.63,0.78,0.80,1.03,0.71,0.95
+PALJ810102,1.32,1.04,0.74,0.97,0.70,1.25,1.48,0.59,1.06,1.01,1.22,1.13,1.47,1.10,0.57,0.77,0.86,1.02,0.72,1.05
+PALJ810103,0.81,1.03,0.81,0.71,1.12,1.03,0.59,0.94,0.85,1.47,1.03,0.77,0.96,1.13,0.75,1.02,1.19,1.24,1.35,1.44
+PALJ810104,0.90,0.75,0.82,0.75,1.12,0.95,0.44,0.83,0.86,1.59,1.24,0.75,0.94,1.41,0.46,0.70,1.20,1.28,1.45,1.73
+PALJ810105,0.84,0.91,1.48,1.28,0.69,1.,0.78,1.76,0.53,0.55,0.49,0.95,0.52,0.88,1.47,1.29,1.05,0.88,1.28,0.51
+PALJ810106,0.65,0.93,1.45,1.47,1.43,0.94,0.75,1.53,0.96,0.57,0.56,0.95,0.71,0.72,1.51,1.46,0.96,0.90,1.12,0.55
+PALJ810107,1.08,0.93,1.05,0.86,1.22,0.95,1.09,0.85,1.02,0.98,1.04,1.01,1.11,0.96,0.91,0.95,1.15,1.17,0.80,1.03
+PALJ810108,1.34,0.91,0.83,1.06,1.27,1.13,1.69,0.47,1.11,0.84,1.39,1.08,0.90,1.02,0.48,1.05,0.74,0.64,0.73,1.18
+PALJ810109,1.15,1.06,0.87,1.,1.03,1.43,1.37,0.64,0.95,0.99,1.22,1.20,1.45,0.92,0.72,0.84,0.97,1.11,0.72,0.82
+PALJ810110,0.89,1.06,0.67,0.71,1.04,1.06,0.72,0.87,1.04,1.14,1.02,1.,1.41,1.32,0.69,0.86,1.15,1.06,1.35,1.66
+PALJ810111,0.82,0.99,1.27,0.98,0.71,1.01,0.54,0.94,1.26,1.67,0.94,0.73,1.30,1.56,0.69,0.65,0.98,1.25,1.26,1.22
+PALJ810112,0.98,1.03,0.66,0.74,1.01,0.63,0.59,0.90,1.17,1.38,1.05,0.83,0.82,1.23,0.73,0.98,1.20,1.26,1.23,1.62
+PALJ810113,0.69,0.,1.52,2.42,0.,1.44,0.63,2.64,0.22,0.43,0.,1.18,0.88,2.20,1.34,1.43,0.28,0.,1.53,0.14
+PALJ810114,0.87,1.30,1.36,1.24,0.83,1.06,0.91,1.69,0.91,0.27,0.67,0.66,0.,0.47,1.54,1.08,1.12,1.24,0.54,0.69
+PALJ810115,0.91,0.77,1.32,0.90,0.50,1.06,0.53,1.61,1.08,0.36,0.77,1.27,0.76,0.37,1.62,1.34,0.87,1.10,1.24,0.52
+PALJ810116,0.92,0.90,1.57,1.22,0.62,0.66,0.92,1.61,0.39,0.79,0.50,0.86,0.50,0.96,1.30,1.40,1.11,0.57,1.78,0.50
+PARJ860101,2.1,4.2,7.0,10.0,1.4,6.0,7.8,5.7,2.1,-8.0,-9.2,5.7,-4.2,-9.2,2.1,6.5,5.2,-10.0,-1.9,-3.7
+PLIV810101,-2.89,-3.30,-3.41,-3.38,-2.49,-3.15,-2.94,-3.25,-2.84,-1.72,-1.61,-3.31,-1.84,-1.63,-2.50,-3.30,-2.91,-1.75,-2.42,-2.08
+PONP800101,12.28,11.49,11.00,10.97,14.93,11.28,11.19,12.01,12.84,14.77,14.10,10.80,14.33,13.43,11.19,11.26,11.65,12.95,13.29,15.07
+PONP800102,7.62,6.81,6.17,6.18,10.93,6.67,6.38,7.31,7.85,9.99,9.37,5.72,9.83,8.99,6.64,6.93,7.08,8.41,8.53,10.38
+PONP800103,2.63,2.45,2.27,2.29,3.36,2.45,2.31,2.55,2.57,3.08,2.98,2.12,3.18,3.02,2.46,2.60,2.55,2.85,2.79,3.21
+PONP800104,13.65,11.28,12.24,10.98,14.49,11.30,12.55,15.36,11.59,14.63,14.01,11.96,13.40,14.08,11.51,11.26,13.00,12.06,12.64,12.88
+PONP800105,14.60,13.24,11.79,13.78,15.90,12.02,13.59,14.18,15.35,14.10,16.49,13.28,16.23,14.18,14.10,13.36,14.50,13.90,14.76,16.30
+PONP800106,10.67,11.05,10.85,10.21,14.15,11.71,11.71,10.95,12.07,12.95,13.07,9.93,15.00,13.27,10.62,11.18,10.53,11.41,11.52,13.86
+PONP800107,3.70,2.53,2.12,2.60,3.03,2.70,3.30,3.13,3.57,7.69,5.88,1.79,5.21,6.60,2.12,2.43,2.60,6.25,3.03,7.14
+PONP800108,6.05,5.70,5.04,4.95,7.86,5.45,5.10,6.16,5.80,7.51,7.37,4.88,6.39,6.62,5.65,5.53,5.81,6.98,6.73,7.62
+PRAM820101,0.305,0.227,0.322,0.335,0.339,0.306,0.282,0.352,0.215,0.278,0.262,0.391,0.280,0.195,0.346,0.326,0.251,0.291,0.293,0.291
+PRAM820102,0.175,0.083,0.090,0.140,0.074,0.093,0.135,0.201,0.125,0.100,0.104,0.058,0.054,0.104,0.136,0.155,0.152,0.092,0.081,0.096
+PRAM820103,0.687,0.590,0.489,0.632,0.263,0.527,0.669,0.670,0.594,0.564,0.541,0.407,0.328,0.577,0.600,0.692,0.713,0.632,0.495,0.529
+PRAM900101,-6.70,51.50,20.10,38.50,-8.40,17.20,34.30,-4.20,12.60,-13.,-11.70,36.80,-14.20,-15.50,0.80,-2.50,-5.,-7.90,2.90,-10.90
+PRAM900102,1.29,0.96,0.90,1.04,1.11,1.27,1.44,0.56,1.22,0.97,1.30,1.23,1.47,1.07,0.52,0.82,0.82,0.99,0.72,0.91
+PRAM900103,0.90,0.99,0.76,0.72,0.74,0.80,0.75,0.92,1.08,1.45,1.02,0.77,0.97,1.32,0.64,0.95,1.21,1.14,1.25,1.49
+PRAM900104,0.78,0.88,1.28,1.41,0.80,0.97,1.,1.64,0.69,0.51,0.59,0.96,0.39,0.58,1.91,1.33,1.03,0.75,1.05,0.47
+PTIO830101,1.10,0.95,0.80,0.65,0.95,1.00,1.00,0.60,0.85,1.10,1.25,1.00,1.15,1.10,0.10,0.75,0.75,1.10,1.10,0.95
+PTIO830102,1.00,0.70,0.60,0.50,1.90,1.00,0.70,0.30,0.80,4.00,2.00,0.70,1.90,3.10,0.20,0.90,1.70,2.20,2.80,4.00
+QIAN880101,0.12,0.04,-0.10,0.01,-0.25,-0.03,-0.02,-0.02,-0.06,-0.07,0.05,0.26,0.00,0.05,-0.19,-0.19,-0.04,-0.06,-0.14,-0.03
+QIAN880102,0.26,-0.14,-0.03,0.15,-0.15,-0.13,0.21,-0.37,0.10,-0.03,-0.02,0.12,0.00,0.12,-0.08,0.01,-0.34,-0.01,-0.29,0.02
+QIAN880103,0.64,-0.10,0.09,0.33,0.03,-0.23,0.51,-0.09,-0.23,-0.22,0.41,-0.17,0.13,-0.03,-0.43,-0.10,-0.07,-0.02,-0.38,-0.01
+QIAN880104,0.29,-0.03,-0.04,0.11,-0.05,0.26,0.28,-0.67,-0.26,0.00,0.47,-0.19,0.27,0.24,-0.34,-0.17,-0.20,0.25,-0.30,-0.01
+QIAN880105,0.68,-0.22,-0.09,-0.02,-0.15,-0.15,0.44,-0.73,-0.14,-0.08,0.61,0.03,0.39,0.06,-0.76,-0.26,-0.10,0.20,-0.04,0.12
+QIAN880106,0.34,0.22,-0.33,0.06,-0.18,0.01,0.20,-0.88,-0.09,-0.03,0.20,-0.11,0.43,0.15,-0.81,-0.35,-0.37,0.07,-0.31,0.13
+QIAN880107,0.57,0.23,-0.36,-0.46,-0.15,0.15,0.26,-0.71,-0.05,0.00,0.48,0.16,0.41,0.03,-1.12,-0.47,-0.54,-0.10,-0.35,0.31
+QIAN880108,0.33,0.10,-0.19,-0.44,-0.03,0.19,0.21,-0.46,0.27,-0.33,0.57,0.23,0.79,0.48,-1.86,-0.23,-0.33,0.15,-0.19,0.24
+QIAN880109,0.13,0.08,-0.07,-0.71,-0.09,0.12,0.13,-0.39,0.32,0.00,0.50,0.37,0.63,0.15,-1.40,-0.28,-0.21,0.02,-0.10,0.17
+QIAN880110,0.31,0.18,-0.10,-0.81,-0.26,0.41,-0.06,-0.42,0.51,-0.15,0.56,0.47,0.58,0.10,-1.33,-0.49,-0.44,0.14,-0.08,-0.01
+QIAN880111,0.21,0.07,-0.04,-0.58,-0.12,0.13,-0.23,-0.15,0.37,0.31,0.70,0.28,0.61,-0.06,-1.03,-0.28,-0.25,0.21,0.16,0.00
+QIAN880112,0.18,0.21,-0.03,-0.32,-0.29,-0.27,-0.25,-0.40,0.28,-0.03,0.62,0.41,0.21,0.05,-0.84,-0.05,-0.16,0.32,0.11,0.06
+QIAN880113,-0.08,0.05,-0.08,-0.24,-0.25,-0.28,-0.19,-0.10,0.29,-0.01,0.28,0.45,0.11,0.00,-0.42,0.07,-0.33,0.36,0.00,-0.13
+QIAN880114,-0.18,-0.13,0.28,0.05,-0.26,0.21,-0.06,0.23,0.24,-0.42,-0.23,0.03,-0.42,-0.18,-0.13,0.41,0.33,-0.10,-0.10,-0.07
+QIAN880115,-0.01,0.02,0.41,-0.09,-0.27,0.01,0.09,0.13,0.22,-0.27,-0.25,0.08,-0.57,-0.12,0.26,0.44,0.35,-0.15,0.15,-0.09
+QIAN880116,-0.19,0.03,0.02,-0.06,-0.29,0.02,-0.10,0.19,-0.16,-0.08,-0.42,-0.09,-0.38,-0.32,0.05,0.25,0.22,-0.19,0.05,-0.15
+QIAN880117,-0.14,0.14,-0.27,-0.10,-0.64,-0.11,-0.39,0.46,-0.04,0.16,-0.57,0.04,0.24,0.08,0.02,-0.12,0.00,-0.10,0.18,0.29
+QIAN880118,-0.31,0.25,-0.53,-0.54,-0.06,0.07,-0.52,0.37,-0.32,0.57,0.09,-0.29,0.29,0.24,-0.31,0.11,0.03,0.15,0.29,0.48
+QIAN880119,-0.10,0.19,-0.89,-0.89,0.13,-0.04,-0.34,-0.45,-0.34,0.95,0.32,-0.46,0.43,0.36,-0.91,-0.12,0.49,0.34,0.42,0.76
+QIAN880120,-0.25,-0.02,-0.77,-1.01,0.13,-0.12,-0.62,-0.72,-0.16,1.10,0.23,-0.59,0.32,0.48,-1.24,-0.31,0.17,0.45,0.77,0.69
+QIAN880121,-0.26,-0.09,-0.34,-0.55,0.47,-0.33,-0.75,-0.56,-0.04,0.94,0.25,-0.55,-0.05,0.20,-1.28,-0.28,0.08,0.22,0.53,0.67
+QIAN880122,0.05,-0.11,-0.40,-0.11,0.36,-0.67,-0.35,0.14,0.02,0.47,0.32,-0.51,-0.10,0.20,-0.79,0.03,-0.15,0.09,0.34,0.58
+QIAN880123,-0.44,-0.13,0.05,-0.20,0.13,-0.58,-0.28,0.08,0.09,-0.04,-0.12,-0.33,-0.21,-0.13,-0.48,0.27,0.47,-0.22,-0.11,0.06
+QIAN880124,-0.31,-0.10,0.06,0.13,-0.11,-0.47,-0.05,0.45,-0.06,-0.25,-0.44,-0.44,-0.28,-0.04,-0.29,0.34,0.27,-0.08,0.06,0.11
+QIAN880125,-0.02,0.04,0.03,0.11,-0.02,-0.17,0.10,0.38,-0.09,-0.48,-0.26,-0.39,-0.14,-0.03,-0.04,0.41,0.36,-0.01,-0.08,-0.18
+QIAN880126,-0.06,0.02,0.10,0.24,-0.19,-0.04,-0.04,0.17,0.19,-0.20,-0.46,-0.43,-0.52,-0.33,0.37,0.43,0.50,-0.32,0.35,0.00
+QIAN880127,-0.05,0.06,0.00,0.15,0.30,-0.08,-0.02,-0.14,-0.07,0.26,0.04,-0.42,0.25,0.09,0.31,-0.11,-0.06,0.19,0.33,0.04
+QIAN880128,-0.19,0.17,-0.38,0.09,0.41,0.04,-0.20,0.28,-0.19,-0.06,0.34,-0.20,0.45,0.07,0.04,-0.23,-0.02,0.16,0.22,0.05
+QIAN880129,-0.43,0.06,0.00,-0.31,0.19,0.14,-0.41,-0.21,0.21,0.29,-0.10,0.33,-0.01,0.25,0.28,-0.23,-0.26,0.15,0.09,-0.10
+QIAN880130,-0.19,-0.07,0.17,-0.27,0.42,-0.29,-0.22,0.17,0.17,-0.34,-0.22,0.00,-0.53,-0.31,0.14,0.22,0.10,-0.15,-0.02,-0.33
+QIAN880131,-0.25,0.12,0.61,0.60,0.18,0.09,-0.12,0.09,0.42,-0.54,-0.55,0.14,-0.47,-0.29,0.89,0.24,0.16,-0.44,-0.19,-0.45
+QIAN880132,-0.27,-0.40,0.71,0.54,0.00,-0.08,-0.12,1.14,0.18,-0.74,-0.54,0.45,-0.76,-0.47,1.40,0.40,-0.10,-0.46,-0.05,-0.86
+QIAN880133,-0.42,-0.23,0.81,0.95,-0.18,-0.01,-0.09,1.24,0.05,-1.17,-0.69,0.09,-0.86,-0.39,1.77,0.63,0.29,-0.37,-0.41,-1.32
+QIAN880134,-0.24,-0.04,0.45,0.65,-0.38,0.01,0.07,0.85,-0.21,-0.65,-0.80,0.17,-0.71,-0.61,2.27,0.33,0.13,-0.44,-0.49,-0.99
+QIAN880135,-0.14,0.21,0.35,0.66,-0.09,0.11,0.06,0.36,-0.31,-0.51,-0.80,-0.14,-0.56,-0.25,1.59,0.32,0.21,-0.17,-0.35,-0.70
+QIAN880136,0.01,-0.13,-0.11,0.78,-0.31,-0.13,0.09,0.14,-0.56,-0.09,-0.81,-0.43,-0.49,-0.20,1.14,0.13,-0.02,-0.20,0.10,-0.11
+QIAN880137,-0.30,-0.09,-0.12,0.44,0.03,0.24,0.18,-0.12,-0.20,-0.07,-0.18,0.06,-0.44,0.11,0.77,-0.09,-0.27,-0.09,-0.25,-0.06
+QIAN880138,-0.23,-0.20,0.06,0.34,0.19,0.47,0.28,0.14,-0.22,0.42,-0.36,-0.15,-0.19,-0.02,0.78,-0.29,-0.30,-0.18,0.07,0.29
+QIAN880139,0.08,-0.01,-0.06,0.04,0.37,0.48,0.36,-0.02,-0.45,0.09,0.24,-0.27,0.16,0.34,0.16,-0.35,-0.04,-0.06,-0.20,0.18
+RACS770101,0.934,0.962,0.986,0.994,0.900,1.047,0.986,1.015,0.882,0.766,0.825,1.040,0.804,0.773,1.047,1.056,1.008,0.848,0.931,0.825
+RACS770102,0.941,1.112,1.038,1.071,0.866,1.150,1.100,1.055,0.911,0.742,0.798,1.232,0.781,0.723,1.093,1.082,1.043,0.867,1.050,0.817
+RACS770103,1.16,1.72,1.97,2.66,0.50,3.87,2.40,1.63,0.86,0.57,0.51,3.90,0.40,0.43,2.04,1.61,1.48,0.75,1.72,0.59
+RACS820101,0.85,2.02,0.88,1.50,0.90,1.71,1.79,1.54,1.59,0.67,1.03,0.88,1.17,0.85,1.47,1.50,1.96,0.83,1.34,0.89
+RACS820102,1.58,1.14,0.77,0.98,1.04,1.24,1.49,0.66,0.99,1.09,1.21,1.27,1.41,1.00,1.46,1.05,0.87,1.23,0.68,0.88
+RACS820103,0.82,2.60,2.07,2.64,0.00,0.00,2.62,1.63,0.00,2.32,0.00,2.86,0.00,0.00,0.00,1.23,2.48,0.00,1.90,1.62
+RACS820104,0.78,1.75,1.32,1.25,3.14,0.93,0.94,1.13,1.03,1.26,0.91,0.85,0.41,1.07,1.73,1.31,1.57,0.98,1.31,1.11
+RACS820105,0.88,0.99,1.02,1.16,1.14,0.93,1.01,0.70,1.87,1.61,1.09,0.83,1.71,1.52,0.87,1.14,0.96,1.96,1.68,1.56
+RACS820106,0.30,0.90,2.73,1.26,0.72,0.97,1.33,3.09,1.33,0.45,0.96,0.71,1.89,1.20,0.83,1.16,0.97,1.58,0.86,0.64
+RACS820107,0.40,1.20,1.24,1.59,2.98,0.50,1.26,1.89,2.71,1.31,0.57,0.87,0.00,1.27,0.38,0.92,1.38,1.53,1.79,0.95
+RACS820108,1.48,1.02,0.99,1.19,0.86,1.42,1.43,0.46,1.27,1.12,1.33,1.36,1.41,1.30,0.25,0.89,0.81,1.27,0.91,0.93
+RACS820109,0.00,0.00,4.14,2.15,0.00,0.00,0.00,6.49,0.00,0.00,0.00,0.00,0.00,2.11,1.99,0.00,1.24,0.00,1.90,0.00
+RACS820110,1.02,1.00,1.31,1.76,1.05,1.05,0.83,2.39,0.40,0.83,1.06,0.94,1.33,0.41,2.73,1.18,0.77,1.22,1.09,0.88
+RACS820111,0.93,1.52,0.92,0.60,1.08,0.94,0.73,0.78,1.08,1.74,1.03,1.00,1.31,1.51,1.37,0.97,1.38,1.12,1.65,1.70
+RACS820112,0.99,1.19,1.15,1.18,2.32,1.52,1.36,1.40,1.06,0.81,1.26,0.91,1.00,1.25,0.00,1.50,1.18,1.33,1.09,1.01
+RACS820113,17.05,21.25,34.81,19.27,28.84,15.42,20.12,38.14,23.07,16.66,10.89,16.46,20.61,16.26,23.94,19.95,18.92,23.36,26.49,17.06
+RACS820114,14.53,17.82,13.59,19.78,30.57,22.18,18.19,37.16,22.63,20.28,14.30,14.07,20.61,19.61,52.63,18.56,21.09,19.78,26.36,21.87
+RADA880101,1.81,-14.92,-6.64,-8.72,1.28,-5.54,-6.81,0.94,-4.66,4.92,4.92,-5.55,2.35,2.98,0.,-3.40,-2.57,2.33,-0.14,4.04
+RADA880102,0.52,-1.32,-0.01,0.,0.,-0.07,-0.79,0.,0.95,2.04,1.76,0.08,1.32,2.09,0.,0.04,0.27,2.51,1.63,1.18
+RADA880103,0.13,-5.,-3.04,-2.23,-2.52,-3.84,-3.43,1.45,-5.61,-2.77,-2.64,-3.97,-3.83,-3.74,0.,-1.66,-2.31,-8.21,-5.97,-2.05
+RADA880104,1.29,-13.60,-6.63,0.,0.,-5.47,-6.02,0.94,-5.61,2.88,3.16,-5.63,1.03,0.89,0.,-3.44,-2.84,-0.18,-1.77,2.86
+RADA880105,1.42,-18.60,-9.67,0.,0.,-9.31,-9.45,2.39,-11.22,0.11,0.52,-9.60,-2.80,-2.85,0.,-5.10,-5.15,-8.39,-7.74,0.81
+RADA880106,93.7,250.4,146.3,142.6,135.2,177.7,182.9,52.6,188.1,182.2,173.7,215.2,197.6,228.6,0.,109.5,142.1,271.6,239.9,157.2
+RADA880107,-0.29,-2.71,-1.18,-1.02,0.,-1.53,-0.90,-0.34,-0.94,0.24,-0.12,-2.05,-0.24,0.,0.,-0.75,-0.71,-0.59,-1.02,0.09
+RADA880108,-0.06,-0.84,-0.48,-0.80,1.36,-0.73,-0.77,-0.41,0.49,1.31,1.21,-1.18,1.27,1.27,0.,-0.50,-0.27,0.88,0.33,1.09
+RICJ880101,0.7,0.4,1.2,1.4,0.6,1.,1.,1.6,1.2,0.9,0.9,1.,0.3,1.2,0.7,1.6,0.3,1.1,1.9,0.7
+RICJ880102,0.7,0.4,1.2,1.4,0.6,1.,1.,1.6,1.2,0.9,0.9,1.,0.3,1.2,0.7,1.6,0.3,1.1,1.9,0.7
+RICJ880103,0.5,0.4,3.5,2.1,0.6,0.4,0.4,1.8,1.1,0.2,0.2,0.7,0.8,0.2,0.8,2.3,1.6,0.3,0.8,0.1
+RICJ880104,1.2,0.7,0.7,0.8,0.8,0.7,2.2,0.3,0.7,0.9,0.9,0.6,0.3,0.5,2.6,0.7,0.8,2.1,1.8,1.1
+RICJ880105,1.6,0.9,0.7,2.6,1.2,0.8,2.,0.9,0.7,0.7,0.3,1.,1.,0.9,0.5,0.8,0.7,1.7,0.4,0.6
+RICJ880106,1.,0.4,0.7,2.2,0.6,1.5,3.3,0.6,0.7,0.4,0.6,0.8,1.,0.6,0.4,0.4,1.,1.4,1.2,1.1
+RICJ880107,1.1,1.5,0.,0.3,1.1,1.3,0.5,0.4,1.5,1.1,2.6,0.8,1.7,1.9,0.1,0.4,0.5,3.1,0.6,1.5
+RICJ880108,1.4,1.2,1.2,0.6,1.6,1.4,0.9,0.6,0.9,0.9,1.1,1.9,1.7,1.,0.3,1.1,0.6,1.4,0.2,0.8
+RICJ880109,1.8,1.3,0.9,1.,0.7,1.3,0.8,0.5,1.,1.2,1.2,1.1,1.5,1.3,0.3,0.6,1.,1.5,0.8,1.2
+RICJ880110,1.8,1.,0.6,0.7,0.,1.,1.1,0.5,2.4,1.3,1.2,1.4,2.7,1.9,0.3,0.5,0.5,1.1,1.3,0.4
+RICJ880111,1.3,0.8,0.6,0.5,0.7,0.2,0.7,0.5,1.9,1.6,1.4,1.,2.8,2.9,0.,0.5,0.6,2.1,0.8,1.4
+RICJ880112,0.7,0.8,0.8,0.6,0.2,1.3,1.6,0.1,1.1,1.4,1.9,2.2,1.,1.8,0.,0.6,0.7,0.4,1.1,1.3
+RICJ880113,1.4,2.1,0.9,0.7,1.2,1.6,1.7,0.2,1.8,0.4,0.8,1.9,1.3,0.3,0.2,1.6,0.9,0.4,0.3,0.7
+RICJ880114,1.1,1.,1.2,0.4,1.6,2.1,0.8,0.2,3.4,0.7,0.7,2.,1.,0.7,0.,1.7,1.,0.,1.2,0.7
+RICJ880115,0.8,0.9,1.6,0.7,0.4,0.9,0.3,3.9,1.3,0.7,0.7,1.3,0.8,0.5,0.7,0.8,0.3,0.,0.8,0.2
+RICJ880116,1.,1.4,0.9,1.4,0.8,1.4,0.8,1.2,1.2,1.1,0.9,1.2,0.8,0.1,1.9,0.7,0.8,0.4,0.9,0.6
+RICJ880117,0.7,1.1,1.5,1.4,0.4,1.1,0.7,0.6,1.,0.7,0.5,1.3,0.,1.2,1.5,0.9,2.1,2.7,0.5,1.
+ROBB760101,6.5,-0.9,-5.1,0.5,-1.3,1.0,7.8,-8.6,1.2,0.6,3.2,2.3,5.3,1.6,-7.7,-3.9,-2.6,1.2,-4.5,1.4
+ROBB760102,2.3,-5.2,0.3,7.4,0.8,-0.7,10.3,-5.2,-2.8,-4.0,-2.1,-4.1,-3.5,-1.1,8.1,-3.5,2.3,-0.9,-3.7,-4.4
+ROBB760103,6.7,0.3,-6.1,-3.1,-4.9,0.6,2.2,-6.8,-1.0,3.2,5.5,0.5,7.2,2.8,-22.8,-3.0,-4.0,4.0,-4.6,2.5
+ROBB760104,2.3,1.4,-3.3,-4.4,6.1,2.7,2.5,-8.3,5.9,-0.5,0.1,7.3,3.5,1.6,-24.4,-1.9,-3.7,-0.9,-0.6,2.3
+ROBB760105,-2.3,0.4,-4.1,-4.4,4.4,1.2,-5.0,-4.2,-2.5,6.7,2.3,-3.3,2.3,2.6,-1.8,-1.7,1.3,-1.0,4.0,6.8
+ROBB760106,-2.7,0.4,-4.2,-4.4,3.7,0.8,-8.1,-3.9,-3.0,7.7,3.7,-2.9,3.7,3.0,-6.6,-2.4,1.7,0.3,3.3,7.1
+ROBB760107,0.0,1.1,-2.0,-2.6,5.4,2.4,3.1,-3.4,0.8,-0.1,-3.7,-3.1,-2.1,0.7,7.4,1.3,0.0,-3.4,4.8,2.7
+ROBB760108,-5.0,2.1,4.2,3.1,4.4,0.4,-4.7,5.7,-0.3,-4.6,-5.6,1.0,-4.8,-1.8,2.6,2.6,0.3,3.4,2.9,-6.0
+ROBB760109,-3.3,0.0,5.4,3.9,-0.3,-0.4,-1.8,-1.2,3.0,-0.5,-2.3,-1.2,-4.3,0.8,6.5,1.8,-0.7,-0.8,3.1,-3.5
+ROBB760110,-4.7,2.0,3.9,1.9,6.2,-2.0,-4.2,5.7,-2.6,-7.0,-6.2,2.8,-4.8,-3.7,3.6,2.1,0.6,3.3,3.8,-6.2
+ROBB760111,-3.7,1.0,-0.6,-0.6,4.0,3.4,-4.3,5.9,-0.8,-0.5,-2.8,1.3,-1.6,1.6,-6.0,1.5,1.2,6.5,1.3,-4.6
+ROBB760112,-2.5,-1.2,4.6,0.0,-4.7,-0.5,-4.4,4.9,1.6,-3.3,-2.0,-0.8,-4.1,-4.1,5.8,2.5,1.7,1.2,-0.6,-3.5
+ROBB760113,-5.1,2.6,4.7,3.1,3.8,0.2,-5.2,5.6,-0.9,-4.5,-5.4,1.0,-5.3,-2.4,3.5,3.2,0.0,2.9,3.2,-6.3
+ROBB790101,-1.0,0.3,-0.7,-1.2,2.1,-0.1,-0.7,0.3,1.1,4.0,2.0,-0.9,1.8,2.8,0.4,-1.2,-0.5,3.0,2.1,1.4
+ROSG850101,86.6,162.2,103.3,97.8,132.3,119.2,113.9,62.9,155.8,158.0,164.1,115.5,172.9,194.1,92.9,85.6,106.5,224.6,177.7,141.0
+ROSG850102,0.74,0.64,0.63,0.62,0.91,0.62,0.62,0.72,0.78,0.88,0.85,0.52,0.85,0.88,0.64,0.66,0.70,0.85,0.76,0.86
+ROSM880101,-0.67,12.1,7.23,8.72,-0.34,6.39,7.35,0.00,3.82,-3.02,-3.02,6.13,-1.30,-3.24,-1.75,4.35,3.86,-2.86,0.98,-2.18
+ROSM880102,-0.67,3.89,2.27,1.57,-2.00,2.12,1.78,0.00,1.09,-3.02,-3.02,2.46,-1.67,-3.24,-1.75,0.10,-0.42,-2.86,0.98,-2.18
+ROSM880103,0.4,0.3,0.9,0.8,0.5,0.7,1.3,0.0,1.0,0.4,0.6,0.4,0.3,0.7,0.9,0.4,0.4,0.6,1.2,0.4
+SIMZ760101,0.73,0.73,-0.01,0.54,0.70,-0.10,0.55,0.00,1.10,2.97,2.49,1.50,1.30,2.65,2.60,0.04,0.44,3.00,2.97,1.69
+SNEP660101,0.239,0.211,0.249,0.171,0.220,0.260,0.187,0.160,0.205,0.273,0.281,0.228,0.253,0.234,0.165,0.236,0.213,0.183,0.193,0.255
+SNEP660102,0.330,-0.176,-0.233,-0.371,0.074,-0.254,-0.409,0.370,-0.078,0.149,0.129,-0.075,-0.092,-0.011,0.370,0.022,0.136,-0.011,-0.138,0.245
+SNEP660103,-0.110,0.079,-0.136,-0.285,-0.184,-0.067,-0.246,-0.073,0.320,0.001,-0.008,0.049,-0.041,0.438,-0.016,-0.153,-0.208,0.493,0.381,-0.155
+SNEP660104,-0.062,-0.167,0.166,-0.079,0.380,-0.025,-0.184,-0.017,0.056,-0.309,-0.264,-0.371,0.077,0.074,-0.036,0.470,0.348,0.050,0.220,-0.212
+SUEM840101,1.071,1.033,0.784,0.680,0.922,0.977,0.970,0.591,0.850,1.140,1.140,0.939,1.200,1.086,0.659,0.760,0.817,1.107,1.020,0.950
+SUEM840102,8.0,0.1,0.1,70.0,26.0,33.0,6.0,0.1,0.1,55.0,33.0,1.0,54.0,18.0,42.0,0.1,0.1,77.0,66.0,0.1
+SWER830101,-0.40,-0.59,-0.92,-1.31,0.17,-0.91,-1.22,-0.67,-0.64,1.25,1.22,-0.67,1.02,1.92,-0.49,-0.55,-0.28,0.50,1.67,0.91
+TANS770101,1.42,1.06,0.71,1.01,0.73,1.02,1.63,0.50,1.20,1.12,1.29,1.24,1.21,1.16,0.65,0.71,0.78,1.05,0.67,0.99
+TANS770102,0.946,1.128,0.432,1.311,0.481,1.615,0.698,0.360,2.168,1.283,1.192,1.203,0.000,0.963,2.093,0.523,1.961,1.925,0.802,0.409
+TANS770103,0.790,1.087,0.832,0.530,1.268,1.038,0.643,0.725,0.864,1.361,1.111,0.735,1.092,1.052,1.249,1.093,1.214,1.114,1.340,1.428
+TANS770104,1.194,0.795,0.659,1.056,0.678,1.290,0.928,1.015,0.611,0.603,0.595,1.060,0.831,0.377,3.159,1.444,1.172,0.452,0.816,0.640
+TANS770105,0.497,0.677,2.072,1.498,1.348,0.711,0.651,1.848,1.474,0.471,0.656,0.932,0.425,1.348,0.179,1.151,0.749,1.283,1.283,0.654
+TANS770106,0.937,1.725,1.080,1.640,1.004,1.078,0.679,0.901,1.085,0.178,0.808,1.254,0.886,0.803,0.748,1.145,1.487,0.803,1.227,0.625
+TANS770107,0.289,1.380,3.169,0.917,1.767,2.372,0.285,4.259,1.061,0.262,0.000,1.288,0.000,0.393,0.000,0.160,0.218,0.000,0.654,0.167
+TANS770108,0.328,2.088,1.498,3.379,0.000,0.000,0.000,0.500,1.204,2.078,0.414,0.835,0.982,1.336,0.415,1.089,1.732,1.781,0.000,0.946
+TANS770109,0.945,0.364,1.202,1.315,0.932,0.704,1.014,2.355,0.525,0.673,0.758,0.947,1.028,0.622,0.579,1.140,0.863,0.777,0.907,0.561
+TANS770110,0.842,0.936,1.352,1.366,1.032,0.998,0.758,1.349,1.079,0.459,0.665,1.045,0.668,0.881,1.385,1.257,1.055,0.881,1.101,0.643
+VASM830101,0.135,0.296,0.196,0.289,0.159,0.236,0.184,0.051,0.223,0.173,0.215,0.170,0.239,0.087,0.151,0.010,0.100,0.166,0.066,0.285
+VASM830102,0.507,0.459,0.287,0.223,0.592,0.383,0.445,0.390,0.310,0.111,0.619,0.559,0.431,0.077,0.739,0.689,0.785,0.160,0.060,0.356
+VASM830103,0.159,0.194,0.385,0.283,0.187,0.236,0.206,0.049,0.233,0.581,0.083,0.159,0.198,0.682,0.366,0.150,0.074,0.463,0.737,0.301
+VELV850101,.03731,.09593,.00359,.12630,.08292,.07606,.00580,.00499,.02415,.00000,.00000,.03710,.08226,.09460,.01979,.08292,.09408,.05481,.05159,.00569
+VENT840101,0.,0.,0.,0.,0.,0.,0.,0.,0.,1.,1.,0.,0.,1.,0.,0.,0.,1.,1.,1.
+VHEG790101,-12.04,39.23,4.25,23.22,3.95,2.16,16.81,-7.85,6.28,-18.32,-17.79,9.71,-8.86,-21.98,5.82,-1.54,-4.15,-16.19,-1.51,-16.22
+WARP780101,10.04,6.18,5.63,5.76,8.89,5.41,5.37,7.99,7.49,8.72,8.79,4.40,9.15,7.98,7.79,7.08,7.00,8.07,6.90,8.88
+WEBA780101,0.89,0.88,0.89,0.87,0.85,0.82,0.84,0.92,0.83,0.76,0.73,0.97,0.74,0.52,0.82,0.96,0.92,0.20,0.49,0.85
+WERD780101,0.52,0.49,0.42,0.37,0.83,0.35,0.38,0.41,0.70,0.79,0.77,0.31,0.76,0.87,0.35,0.49,0.38,0.86,0.64,0.72
+WERD780102,0.16,-0.20,1.03,-0.24,-0.12,-0.55,-0.45,-0.16,-0.18,-0.19,-0.44,-0.12,-0.79,-0.25,-0.59,-0.01,0.05,-0.33,-0.42,-0.46
+WERD780103,0.15,-0.37,0.69,-0.22,-0.19,-0.06,0.14,0.36,-0.25,0.02,0.06,-0.16,0.11,1.18,0.11,0.13,0.28,-0.12,0.19,-0.08
+WERD780104,-0.07,-0.40,-0.57,-0.80,0.17,-0.26,-0.63,0.27,-0.49,0.06,-0.17,-0.45,0.03,0.40,-0.47,-0.11,0.09,-0.61,-0.61,-0.11
+WOEC730101,7.0,9.1,10.0,13.0,5.5,8.6,12.5,7.9,8.4,4.9,4.9,10.1,5.3,5.0,6.6,7.5,6.6,5.3,5.7,5.6
+WOLR810101,1.94,-19.92,-9.68,-10.95,-1.24,-9.38,-10.20,2.39,-10.27,2.15,2.28,-9.52,-1.48,-0.76,-3.68,-5.06,-4.88,-5.88,-6.11,1.99
+WOLS870101,0.07,2.88,3.22,3.64,0.71,2.18,3.08,2.23,2.41,-4.44,-4.19,2.84,-2.49,-4.92,-1.22,1.96,0.92,-4.75,-1.39,-2.69
+WOLS870102,-1.73,2.52,1.45,1.13,-0.97,0.53,0.39,-5.36,1.74,-1.68,-1.03,1.41,-0.27,1.30,0.88,-1.63,-2.09,3.65,2.32,-2.53
+WOLS870103,0.09,-3.44,0.84,2.36,4.13,-1.14,-0.07,0.30,1.11,-1.03,-0.98,-3.14,-0.41,0.45,2.23,0.57,-1.40,0.85,0.01,-1.29
+YUTK870101,8.5,0.,8.2,8.5,11.0,6.3,8.8,7.1,10.1,16.8,15.0,7.9,13.3,11.2,8.2,7.4,8.8,9.9,8.8,12.0
+YUTK870102,6.8,0.,6.2,7.0,8.3,8.5,4.9,6.4,9.2,10.0,12.2,7.5,8.4,8.3,6.9,8.0,7.0,5.7,6.8,9.4
+YUTK870103,18.08,0.,17.47,17.36,18.17,17.93,18.16,18.24,18.49,18.62,18.60,17.96,18.11,17.30,18.16,17.57,17.54,17.19,17.99,18.30
+YUTK870104,18.56,0.,18.24,17.94,17.84,18.51,17.97,18.57,18.64,19.21,19.01,18.36,18.49,17.95,18.77,18.06,17.71,16.87,18.23,18.98
+ZASB820101,-0.152,-0.089,-0.203,-0.355,0.,-0.181,-0.411,-0.190,0.,-0.086,-0.102,-0.062,-0.107,0.001,-0.181,-0.203,-0.170,0.275,0.,-0.125
+ZIMJ680101,0.83,0.83,0.09,0.64,1.48,0.00,0.65,0.10,1.10,3.07,2.52,1.60,1.40,2.75,2.70,0.14,0.54,0.31,2.97,1.79
+ZIMJ680102,11.50,14.28,12.82,11.68,13.46,14.45,13.57,3.40,13.69,21.40,21.40,15.71,16.25,19.80,17.43,9.47,15.77,21.67,18.03,21.57
+ZIMJ680103,0.00,52.00,3.38,49.70,1.48,3.53,49.90,0.00,51.60,0.13,0.13,49.50,1.43,0.35,1.58,1.67,1.66,2.10,1.61,0.13
+ZIMJ680104,6.00,10.76,5.41,2.77,5.05,5.65,3.22,5.97,7.59,6.02,5.98,9.74,5.74,5.48,6.30,5.68,5.66,5.89,5.66,5.96
+ZIMJ680105,9.9,4.6,5.4,2.8,2.8,9.0,3.2,5.6,8.2,17.1,17.6,3.5,14.9,18.8,14.8,6.9,9.5,17.1,15.0,14.3
+AURR980101,0.94,1.15,0.79,1.19,0.60,0.94,1.41,1.18,1.15,1.07,0.95,1.03,0.88,1.06,1.18,0.69,0.87,0.91,1.04,0.90
+AURR980102,0.98,1.14,1.05,1.05,0.41,0.90,1.04,1.25,1.01,0.88,0.80,1.06,1.12,1.12,1.31,1.02,0.80,0.90,1.12,0.87
+AURR980103,1.05,0.81,0.91,1.39,0.60,0.87,1.11,1.26,1.43,0.95,0.96,0.97,0.99,0.95,1.05,0.96,1.03,1.06,0.94,0.62
+AURR980104,0.75,0.90,1.24,1.72,0.66,1.08,1.10,1.14,0.96,0.80,1.01,0.66,1.02,0.88,1.33,1.20,1.13,0.68,0.80,0.58
+AURR980105,0.67,0.76,1.28,1.58,0.37,1.05,0.94,0.98,0.83,0.78,0.79,0.84,0.98,0.96,1.12,1.25,1.41,0.94,0.82,0.67
+AURR980106,1.10,1.05,0.72,1.14,0.26,1.31,2.30,0.55,0.83,1.06,0.84,1.08,0.90,0.90,1.67,0.81,0.77,1.26,0.99,0.76
+AURR980107,1.39,0.95,0.67,1.64,0.52,1.60,2.07,0.65,1.36,0.64,0.91,0.80,1.10,1.00,0.94,0.69,0.92,1.10,0.73,0.70
+AURR980108,1.43,1.33,0.55,0.90,0.52,1.43,1.70,0.56,0.66,1.18,1.52,0.82,1.68,1.10,0.15,0.61,0.75,1.68,0.65,1.14
+AURR980109,1.55,1.39,0.60,0.61,0.59,1.43,1.34,0.37,0.89,1.47,1.36,1.27,2.13,1.39,0.03,0.44,0.65,1.10,0.93,1.18
+AURR980110,1.80,1.73,0.73,0.90,0.55,0.97,1.73,0.32,0.46,1.09,1.47,1.24,1.64,0.96,0.15,0.67,0.70,0.68,0.91,0.81
+AURR980111,1.52,1.49,0.58,1.04,0.26,1.41,1.76,0.30,0.83,1.25,1.26,1.10,1.14,1.14,0.44,0.66,0.73,0.68,1.04,1.03
+AURR980112,1.49,1.41,0.67,0.94,0.37,1.52,1.55,0.29,0.96,1.04,1.40,1.17,1.84,0.86,0.20,0.68,0.79,1.52,1.06,0.94
+AURR980113,1.73,1.24,0.70,0.68,0.63,0.88,1.16,0.32,0.76,1.15,1.80,1.22,2.21,1.35,0.07,0.65,0.46,1.57,1.10,0.94
+AURR980114,1.33,1.39,0.64,0.60,0.44,1.37,1.43,0.20,1.02,1.58,1.63,1.71,1.76,1.22,0.07,0.42,0.57,1.00,1.02,1.08
+AURR980115,1.87,1.66,0.70,0.91,0.33,1.24,1.88,0.33,0.89,0.90,1.65,1.63,1.35,0.67,0.03,0.71,0.50,1.00,0.73,0.51
+AURR980116,1.19,1.45,1.33,0.72,0.44,1.43,1.27,0.74,1.55,0.61,1.36,1.45,1.35,1.20,0.10,1.02,0.82,0.58,1.06,0.46
+AURR980117,0.77,1.11,1.39,0.79,0.44,0.95,0.92,2.74,1.65,0.64,0.66,1.19,0.74,1.04,0.66,0.64,0.82,0.58,0.93,0.53
+AURR980118,0.93,0.96,0.82,1.15,0.67,1.02,1.07,1.08,1.40,1.14,1.16,1.27,1.11,1.05,1.01,0.71,0.84,1.06,1.15,0.74
+AURR980119,1.09,1.29,1.03,1.17,0.26,1.08,1.31,0.97,0.88,0.97,0.87,1.13,0.96,0.84,2.01,0.76,0.79,0.91,0.64,0.77
+AURR980120,0.71,1.09,0.95,1.43,0.65,0.87,1.19,1.07,1.13,1.05,0.84,1.10,0.80,0.95,1.70,0.65,.086,1.25,0.85,1.12
+ONEK900101,13.4,13.3,12.0,11.7,11.6,12.8,12.2,11.3,11.6,12.0,13.0,13.0,12.8,12.1,6.5,12.2,11.7,12.4,12.1,11.9
+ONEK900102,-0.77,-0.68,-0.07,-0.15,-0.23,-0.33,-0.27,0.00,-0.06,-0.23,-0.62,-0.65,-0.50,-0.41,3,-0.35,-0.11,-0.45,-0.17,-0.14
+VINM940101,0.984,1.008,1.048,1.068,0.906,1.037,1.094,1.031,0.950,0.927,0.935,1.102,0.952,0.915,1.049,1.046,0.997,0.904,0.929,0.931
+VINM940102,1.315,1.310,1.380,1.372,1.196,1.342,1.376,1.382,1.279,1.241,1.234,1.367,1.269,1.247,1.342,1.381,1.324,1.186,1.199,1.235
+VINM940103,0.994,1.026,1.022,1.022,0.939,1.041,1.052,1.018,0.967,0.977,0.982,1.029,0.963,0.934,1.050,1.025,0.998,0.938,0.981,0.968
+VINM940104,0.783,0.807,0.799,0.822,0.785,0.817,0.826,0.784,0.777,0.776,0.783,0.834,0.806,0.774,0.809,0.811,0.795,0.796,0.788,0.781
+MUNV940101,0.423,0.503,0.906,0.870,0.877,0.594,0.167,1.162,0.802,0.566,0.494,0.615,0.444,0.706,1.945,0.928,0.884,0.690,0.778,0.706
+MUNV940102,0.619,0.753,1.089,0.932,1.107,0.770,0.675,1.361,1.034,0.876,0.740,0.784,0.736,0.968,1.780,0.969,1.053,0.910,1.009,0.939
+MUNV940103,1.080,0.976,1.197,1.266,0.733,1.050,1.085,1.104,0.906,0.583,0.789,1.026,0.812,0.685,1.412,0.987,0.784,0.755,0.665,0.546
+MUNV940104,0.978,0.784,0.915,1.038,0.573,0.863,0.962,1.405,0.724,0.502,0.766,0.841,0.729,0.585,2.613,0.784,0.569,0.671,0.560,0.444
+MUNV940105,1.40,1.23,1.61,1.89,1.14,1.33,1.42,2.06,1.25,1.02,1.33,1.34,1.12,1.07,3.90,1.20,0.99,1.10,0.98,0.87
+WIMW960101,4.08,3.91,3.83,3.02,4.49,3.67,2.23,4.24,4.08,4.52,4.81,3.77,4.48,5.38,3.80,4.12,4.11,6.10,5.19,4.18
+KIMC930101,-0.35,-0.44,-0.38,-0.41,-0.47,-0.40,-0.41,0.0,-0.46,-0.56,-0.48,-0.41,-0.46,-0.55,-0.23,-0.39,-0.48,-0.48,-0.50,-0.53
+MONM990101,0.5,1.7,1.7,1.6,0.6,1.6,1.6,1.3,1.6,0.6,0.4,1.6,0.5,0.4,1.7,0.7,0.4,0.7,0.6,0.5
+BLAM930101,0.96,0.77,0.39,0.42,0.42,0.80,0.53,0.00,0.57,0.84,0.92,0.73,0.86,0.59,-2.50,0.53,0.54,0.58,0.72,0.63
+PARS000101,0.343,0.353,0.409,0.429,0.319,0.395,0.405,0.389,0.307,0.296,0.287,0.429,0.293,0.292,0.432,0.416,0.362,0.268,0.22,0.307
+PARS000102,0.320,0.327,0.384,0.424,0.198,0.436,0.514,0.374,0.299,0.306,0.340,0.446,0.313,0.314,0.354,0.376,0.339,0.291,0.287,0.294
+KUMS000101,8.9,4.6,4.4,6.3,0.6,2.8,6.9,9.4,2.2,7.0,7.4,6.1,2.3,3.3,4.2,4.0,5.7,1.3,4.5,8.2
+KUMS000102,9.2,3.6,5.1,6.0,1.0,2.9,6.0,9.4,2.1,6.0,7.7,6.5,2.4,3.4,4.2,5.5,5.7,1.2,3.7,8.2
+KUMS000103,14.1,5.5,3.2,5.7,0.1,3.7,8.8,4.1,2.0,7.1,9.1,7.7,3.3,5.0,0.7,3.9,4.4,1.2,4.5,5.9
+KUMS000104,13.4,3.9,3.7,4.6,0.8,4.8,7.8,4.6,3.3,6.5,10.6,7.5,3.0,4.5,1.3,3.8,4.6,1.0,3.3,7.1
+TAKK010101,9.8,7.3,3.6,4.9,3.0,2.4,4.4,0,11.9,17.2,17.0,10.5,11.9,23.0,15.0,2.6,6.9,24.2,17.2,15.3
+FODM020101,0.70,0.95,1.47,0.87,1.17,0.73,0.96,0.64,1.39,1.29,1.44,0.91,0.91,1.34,0.12,0.84,0.74,1.80,1.68,1.20
+NADH010101,58,-184,-93,-97,116,-139,-131,-11,-73,107,95,-24,78,92,-79,-34,-7,59,-11,100
+NADH010102,51,-144,-84,-78,137,-128,-115,-13,-55,106,103,-205,73,108,-79,-26,-3,69,11,108
+NADH010103,41,-109,-74,-47,169,-104,-90,-18,-35,104,103,-148,77,128,-81,-31,10,102,36,116
+NADH010104,32,-95,-73,-29,182,-95,-74,-22,-25,106,104,-124,82,132,-82,-34,20,118,44,113
+NADH010105,24,-79,-76,0,194,-87,-57,-28,-31,102,103,-9,90,131,-85,-36,34,116,43,111
+NADH010106,5,-57,-77,45,224,-67,-8,-47,-50,83,82,-38,83,117,-103,-41,79,130,27,117
+NADH010107,-2,-41,-97,248,329,-37,117,-66,-70,28,36,115,62,120,-132,-52,174,179,-7,114
+MONM990201,0.4,1.5,1.6,1.5,0.7,1.4,1.3,1.1,1.4,0.5,0.3,1.4,0.5,0.3,1.6,0.9,0.7,0.9,0.9,0.4
+KOEP990101,-0.04,-0.30,0.25,0.27,0.57,-0.02,-0.33,1.24,-0.11,-0.26,-0.38,-0.18,-0.09,-0.01,0.,0.15,0.39,0.21,0.05,-0.06
+KOEP990102,-0.12,0.34,1.05,1.12,-0.63,1.67,0.91,0.76,1.34,-0.77,0.15,0.29,-0.71,-0.67,0.,1.45,-0.70,-0.14,-0.49,-0.70
+CEDJ970101,8.6,4.2,4.6,4.9,2.9,4.0,5.1,7.8,2.1,4.6,8.8,6.3,2.5,3.7,4.9,7.3,6.0,1.4,3.6,6.7
+CEDJ970102,7.6,5.0,4.4,5.2,2.2,4.1,6.2,6.9,2.1,5.1,9.4,5.8,2.1,4.0,5.4,7.2,6.1,1.4,3.2,6.7
+CEDJ970103,8.1,4.6,3.7,3.8,2.0,3.1,4.6,7.0,2.0,6.7,11.0,4.4,2.8,5.6,4.7,7.3,5.6,1.8,3.3,7.7
+CEDJ970104,7.9,4.9,4.0,5.5,1.9,4.4,7.1,7.1,2.1,5.2,8.6,6.7,2.4,3.9,5.3,6.6,5.3,1.2,3.1,6.8
+CEDJ970105,8.3,8.7,3.7,4.7,1.6,4.7,6.5,6.3,2.1,3.7,7.4,7.9,2.3,2.7,6.9,8.8,5.1,0.7,2.4,5.3
+FUKS010101,4.47,8.48,3.89,7.05,0.29,2.87,16.56,8.29,1.74,3.30,5.06,12.98,1.71,2.32,5.41,4.27,3.83,0.67,2.75,4.05
+FUKS010102,6.77,6.87,5.50,8.57,0.31,5.24,12.93,7.95,2.80,2.72,4.43,10.20,1.87,1.92,4.79,5.41,5.36,0.54,2.26,3.57
+FUKS010103,7.43,4.51,9.12,8.71,0.42,5.42,5.86,9.40,1.49,1.76,2.74,9.67,0.60,1.18,5.60,9.60,8.95,1.18,3.26,3.10
+FUKS010104,5.22,7.30,6.06,7.91,1.01,6.00,10.66,5.81,2.27,2.36,4.52,12.68,1.85,1.68,5.70,6.99,5.16,0.56,2.16,4.10
+FUKS010105,9.88,3.71,2.35,3.50,1.12,1.66,4.02,6.88,1.88,10.08,13.21,3.39,2.44,5.27,3.80,4.10,4.98,1.11,4.07,12.53
+FUKS010106,10.98,3.26,2.85,3.37,1.47,2.30,3.51,7.48,2.20,9.74,12.79,2.54,3.10,4.97,3.42,4.93,5.55,1.28,3.55,10.69
+FUKS010107,9.95,3.05,4.84,4.46,1.30,2.64,2.58,8.87,1.99,7.73,9.66,2.00,2.45,5.41,3.20,6.03,5.62,2.60,6.15,9.46
+FUKS010108,8.26,2.80,2.54,2.80,2.67,2.86,2.67,5.62,1.98,8.95,16.46,1.89,2.67,7.32,3.30,6.00,5.00,2.01,3.96,10.24
+FUKS010109,7.39,5.91,3.06,5.14,0.74,2.22,9.80,7.53,1.82,6.96,9.45,7.81,2.10,3.91,4.54,4.18,4.45,0.90,3.46,8.62
+FUKS010110,9.07,4.90,4.05,5.73,0.95,3.63,7.77,7.69,2.47,6.56,9.00,6.01,2.54,3.59,4.04,5.15,5.46,0.95,2.96,7.47
+FUKS010111,8.82,3.71,6.77,6.38,0.90,3.89,4.05,9.11,1.77,5.05,6.54,5.45,1.62,3.51,4.28,7.64,7.12,1.96,4.85,6.60
+FUKS010112,6.65,5.17,4.40,5.50,1.79,4.52,6.89,5.72,2.13,5.47,10.15,7.59,2.24,4.34,4.56,6.52,5.08,1.24,3.01,7.00
+AVBF000101,0.163,0.220,0.124,0.212,0.316,0.274,0.212,0.080,0.315,0.474,0.315,0.255,0.356,0.410,NA,0.290,0.412,0.325,0.354,0.515
+AVBF000102,0.236,0.233,0.189,0.168,0.259,0.314,0.306,-0.170,0.256,0.391,0.293,0.231,0.367,0.328,NA,0.202,0.308,0.197,0.223,0.436
+AVBF000103,-0.490,-0.429,-0.387,-0.375,-0.352,-0.422,-0.382,-0.647,-0.357,-0.268,-0.450,-0.409,-0.375,-0.309,NA,-0.426,-0.240,-0.325,-0.288,-0.220
+AVBF000104,-0.871,-0.727,-0.741,-0.737,-0.666,-0.728,-0.773,-0.822,-0.685,-0.617,-0.798,-0.715,-0.717,-0.649,NA,-0.679,-0.629,-0.669,-0.655,-0.599
+AVBF000105,-0.393,-0.317,-0.268,-0.247,-0.222,-0.291,-0.260,-0.570,-0.244,-0.144,-0.281,-0.294,-0.274,-0.189,NA,-0.280,-0.152,-0.206,-0.155,-0.080
+AVBF000106,-0.378,-0.369,-0.245,-0.113,-0.206,-0.290,-0.165,-0.560,-0.295,-0.134,-0.266,-0.335,-0.260,-0.187,NA,-0.251,-0.093,-0.188,-0.147,-0.084
+AVBF000107,-0.729,-0.535,-0.597,-0.545,-0.408,-0.492,-0.532,-0.860,-0.519,-0.361,-0.462,-0.508,-0.518,-0.454,NA,-0.278,-0.367,-0.455,-0.439,-0.323
+AVBF000108,-0.623,-0.567,-0.619,-0.626,-0.571,-0.559,-0.572,-0.679,-0.508,-0.199,-0.527,-0.581,-0.571,-0.461,NA,-0.458,-0.233,-0.327,-0.451,-0.263
+AVBF000109,-0.376,-0.280,-0.403,-0.405,-0.441,-0.362,-0.362,-0.392,-0.345,-0.194,-0.317,-0.412,-0.312,-0.237,NA,-0.374,-0.243,-0.111,-0.171,-0.355
+YANJ020101,NA,0.62,0.76,0.66,0.83,0.59,0.73,NA,0.92,0.88,0.89,0.77,0.77,0.92,0.94,0.58,0.73,0.86,0.93,0.88
+MITS020101,0,2.45,0,0,0,1.25,1.27,0,1.45,0,0,3.67,0,0,0,0,0,6.93,5.06,0
+TSAJ990101,89.3,190.3,122.4,114.4,102.5,146.9,138.8,63.8,157.5,163.0,163.1,165.1,165.8,190.8,121.6,94.2,119.6,226.4,194.6,138.2
+TSAJ990102,90.0,194.0,124.7,117.3,103.3,149.4,142.2,64.9,160.0,163.9,164.0,167.3,167.0,191.9,122.9,95.4,121.5,228.2,197.0,139.0
+COSI940101,0.0373,0.0959,0.0036,0.1263,0.0829,0.0761,0.0058,0.0050,0.0242,0.0000,0.0000,0.0371,0.0823,0.0946,0.0198,0.0829,0.0941,0.0548,0.0516,0.0057
+PONP930101,0.85,0.20,-0.48,-1.10,2.10,-0.42,-0.79,0,0.22,3.14,1.99,-1.19,1.42,1.69,-1.14,-0.52,-0.08,1.76,1.37,2.53
+WILM950101,0.06,-0.85,0.25,-0.20,0.49,0.31,-0.10,0.21,-2.24,3.48,3.50,-1.62,0.21,4.80,0.71,-0.62,0.65,2.29,1.89,1.59
+WILM950102,2.62,1.26,-1.27,-2.84,0.73,-1.69,-0.45,-1.15,-0.74,4.38,6.57,-2.78,-3.12,9.14,-0.12,-1.39,1.81,5.91,1.39,2.30
+WILM950103,-1.64,-3.28,0.83,0.70,9.30,-0.04,1.18,-1.85,7.17,3.02,0.83,-2.36,4.26,-1.36,3.12,1.59,2.31,2.61,2.37,0.52
+WILM950104,-2.34,1.60,2.81,-0.48,5.03,0.16,1.30,-1.06,-3.00,7.26,1.09,1.56,0.62,2.57,-0.15,1.93,0.19,3.59,-2.58,2.06
+KUHL950101,0.78,1.58,1.20,1.35,0.55,1.19,1.45,0.68,0.99,0.47,0.56,1.10,0.66,0.47,0.69,1.00,1.05,0.70,1.00,0.51
+GUOD860101,25,-7,-7,2,32,0,14,-2,-26,91,100,-26,68,100,25,-2,7,109,56,62
+JURD980101,1.10,-5.10,-3.50,-3.60,2.50,-3.68,-3.20,-0.64,-3.20,4.50,3.80,-4.11,1.90,2.80,-1.90,-0.50,-0.70,-0.46,-1.3,4.2
+BASU050101,0.1366,0.0363,-0.0345,-0.1233,0.2745,0.0325,-0.0484,-0.0464,0.0549,0.4172,0.4251,-0.0101,0.1747,0.4076,0.0019,-0.0433,0.0589,0.2362,0.3167,0.4084
+BASU050102,0.0728,0.0394,-0.0390,-0.0552,0.3557,0.0126,-0.0295,-0.0589,0.0874,0.3805,0.3819,-0.0053,0.1613,0.4201,-0.0492,-0.0282,0.0239,0.4114,0.3113,0.2947
+BASU050103,0.1510,-0.0103,0.0381,0.0047,0.3222,0.0246,-0.0639,0.0248,0.1335,0.4238,0.3926,-0.0158,0.2160,0.3455,0.0844,0.0040,0.1462,0.2657,0.2998,0.3997
+SUYM030101,-0.058,0.000,0.027,0.016,0.447,-0.073,-0.128,0.331,0.195,0.060,0.138,-0.112,0.275,0.240,-0.478,-0.177,-0.163,0.564,0.322,-0.052
+PUNT030101,-0.17,0.37,0.18,0.37,-0.06,0.26,0.15,0.01,-0.02,-0.28,-0.28,0.32,-0.26,-0.41,0.13,0.05,0.02,-0.15,-0.09,-0.17
+PUNT030102,-0.15,0.32,0.22,0.41,-0.15,0.03,0.30,0.08,0.06,-0.29,-0.36,0.24,-0.19,-0.22,0.15,0.16,-0.08,-0.28,-0.03,-0.24
+GEOR030101,0.964,1.143,0.944,0.916,0.778,1.047,1.051,0.835,1.014,0.922,1.085,0.944,1.032,1.119,1.299,0.947,1.017,0.895,1,0.955
+GEOR030102,0.974,1.129,0.988,0.892,0.972,1.092,1.054,0.845,0.949,0.928,1.11,0.946,0.923,1.122,1.362,0.932,1.023,0.879,0.902,0.923
+GEOR030103,0.938,1.137,0.902,0.857,0.6856,0.916,1.139,0.892,1.109,0.986,1,0.952,1.077,1.11,1.266,0.956,1.018,0.971,1.157,0.959
+GEOR030104,1.042,1.069,0.828,0.97,0.5,1.111,0.992,0.743,1.034,0.852,1.193,0.979,0.998,0.981,1.332,0.984,0.992,0.96,1.12,1.001
+GEOR030105,1.065,1.131,0.762,0.836,1.015,0.861,0.736,1.022,0.973,1.189,1.192,0.478,1.369,1.368,1.241,1.097,0.822,1.017,0.836,1.14
+GEOR030106,0.99,1.132,0.873,0.915,0.644,0.999,1.053,0.785,1.054,0.95,1.106,1.003,1.093,1.121,1.314,0.911,0.988,0.939,1.09,0.957
+GEOR030107,0.892,1.154,1.144,0.925,1.035,1.2,1.115,0.917,0.992,0.817,0.994,0.944,0.782,1.058,1.309,0.986,1.11,0.841,0.866,0.9
+GEOR030108,1.092,1.239,0.927,0.919,0.662,1.124,1.199,0.698,1.012,0.912,1.276,1.008,1.171,1.09,0.8,0.886,0.832,0.981,1.075,0.908
+GEOR030109,0.843,1.038,0.956,0.906,0.896,0.968,0.9,0.978,1.05,0.946,0.885,0.893,0.878,1.151,1.816,1.003,1.189,0.852,0.945,0.999
+ZHOH040101,2.18,2.71,1.85,1.75,3.89,2.16,1.89,1.17,2.51,4.50,4.71,2.12,3.63,5.88,2.09,1.66,2.18,6.46,5.01,3.77
+ZHOH040102,1.79,3.20,2.83,2.33,2.22,2.37,2.52,0.70,3.06,4.59,4.72,2.50,3.91,4.84,2.45,1.82,2.45,5.64,4.46,3.67
+ZHOH040103,13.4,8.5,7.6,8.2,22.6,8.5,7.3,7.0,11.3,20.3,20.8,6.1,15.7,23.9,9.9,8.2,10.3,24.5,19.5,19.5
+BAEK050101,0.0166,-0.0762,-0.0786,-0.1278,0.5724,-0.1051,-0.1794,-0.0442,0.1643,0.2758,0.2523,-0.2134,0.0197,0.3561,-0.4188,-0.1629,-0.0701,0.3836,0.2500,0.1782
+HARY940101,90.1,192.8,127.5,117.1,113.2,149.4,140.8,63.8,159.3,164.9,164.6,170.0,167.7,193.5,123.1,94.2,120.0,197.1,231.7,139.1
+PONJ960101,91.5,196.1,138.3,135.2,114.4,156.4,154.6,67.5,163.2,162.6,163.4,162.5,165.9,198.8,123.4,102.0,126.0,209.8,237.2,138.4
+DIGM050101,1.076,1.361,1.056,1.290,0.753,0.729,1.118,1.346,0.985,0.926,1.054,1.105,0.974,0.869,0.820,1.342,0.871,0.666,0.531,1.131
+WOLR790101,1.12,-2.55,-0.83,-0.83,0.59,-0.78,-0.92,1.20,-0.93,1.16,1.18,-0.80,0.55,0.67,0.54,-0.05,-0.02,-0.19,-0.23,1.13
+OLSK800101,1.38,0.00,0.37,0.52,1.43,0.22,0.71,1.34,0.66,2.32,1.47,0.15,1.78,1.72,0.85,0.86,0.89,0.82,0.47,1.99
+KIDA850101,-0.27,1.87,0.81,0.81,-1.05,1.10,1.17,-0.16,0.28,-0.77,-1.10,1.70,-0.73,-1.43,-0.75,0.42,0.63,-1.57,-0.56,-0.40
+GUYH850102,0.05,0.12,0.29,0.41,-0.84,0.46,0.38,0.31,-0.41,-0.69,-0.62,0.57,-0.38,-0.45,0.46,0.12,0.38,-0.98,-0.25,-0.46
+GUYH850103,0.54,-0.16,0.38,0.65,-1.13,0.05,0.38,NA,-0.59,-2.15,-1.08,0.48,-0.97,-1.51,-0.22,0.65,0.27,-1.61,-1.13,-0.75
+GUYH850104,-0.31,1.30,0.49,0.58,-0.87,0.70,0.68,-0.33,0.13,-0.66,-0.53,1.79,-0.38,-0.45,0.34,0.10,0.21,-0.27,0.40,-0.62
+GUYH850105,-0.27,2.00,0.61,0.50,-0.23,1.00,0.33,-0.22,0.37,-0.80,-0.44,1.17,-0.31,-0.55,0.36,0.17,0.18,0.05,0.48,-0.65
+ROSM880104,0.39,NA,-1.91,-0.71,0.25,-1.30,-0.18,0.00,-0.60,1.82,1.82,0.32,0.96,2.27,NA,-1.24,-1.00,2.13,1.47,1.30
+ROSM880105,0.39,-3.95,-1.91,-3.81,0.25,-1.30,-2.91,0.00,-0.64,1.82,1.82,-2.77,0.96,2.27,NA,-1.24,-1.00,2.13,1.47,1.30
+JACR890101,0.18,-5.40,-1.30,-2.36,0.27,-1.22,-2.10,0.09,-1.48,0.37,0.41,-2.53,0.44,0.50,-0.20,-0.40,-0.34,-0.01,-0.08,0.32
+COWR900101,0.42,-1.56,-1.03,-0.51,0.84,-0.96,-0.37,0.00,-2.28,1.81,1.80,-2.03,1.18,1.74,0.86,-0.64,-0.26,1.46,0.51,1.34
+BLAS910101,0.616,0.000,0.236,0.028,0.680,0.251,0.043,0.501,0.165,0.943,0.943,0.283,0.738,1.000,0.711,0.359,0.450,0.878,0.880,0.825
+CASG920101,0.2,-0.7,-0.5,-1.4,1.9,-1.1,-1.3,-0.1,0.4,1.4,0.5,-1.6,0.5,1.0,-1.0,-0.7,-0.4,1.6,0.5,0.7
+CORJ870101,50.76,48.66,45.80,43.17,58.74,46.09,43.48,50.27,49.33,57.30,53.89,42.92,52.75,53.45,45.39,47.24,49.26,53.59,51.79,56.12
+CORJ870102,-0.414,-0.584,-0.916,-1.310,0.162,-0.905,-1.218,-0.684,-0.630,1.237,1.215,-0.670,1.020,1.938,-0.503,-0.563,-0.289,0.514,1.699,0.899
+CORJ870103,-0.96,0.75,-1.94,-5.68,4.54,-5.30,-3.86,-1.28,-0.62,5.54,6.81,-5.62,4.76,5.06,-4.47,-1.92,-3.99,0.21,3.34,5.39
+CORJ870104,-0.26,0.08,-0.46,-1.30,0.83,-0.83,-0.73,-0.40,-0.18,1.10,1.52,-1.01,1.09,1.09,-0.62,-0.55,-0.71,-0.13,0.69,1.15
+CORJ870105,-0.73,-1.03,-5.29,-6.13,0.64,-0.96,-2.90,-2.67,3.03,5.04,4.91,-5.99,3.34,5.20,-4.32,-3.00,-1.91,0.51,2.87,3.98
+CORJ870106,-1.35,-3.89,-10.96,-11.88,4.37,-1.34,-4.56,-5.82,6.54,10.93,9.88,-11.92,7.47,11.35,-10.86,-6.21,-4.83,1.80,7.61,8.20
+CORJ870107,-0.56,-0.26,-2.87,-4.31,1.78,-2.31,-2.35,-1.35,0.81,3.83,4.09,-4.08,3.11,3.67,-3.22,-1.85,-1.97,-0.11,2.17,3.31
+CORJ870108,1.37,1.33,6.29,8.93,-4.47,3.88,4.04,3.39,-1.65,-7.92,-8.68,7.70,-7.13,-7.96,6.25,4.08,4.02,0.79,-4.73,-6.94
+MIYS990101,-0.02,0.44,0.63,0.72,-0.96,0.56,0.74,0.38,0.00,-1.89,-2.29,1.01,-1.36,-2.22,0.47,0.55,0.25,-1.28,-0.88,-1.34
+MIYS990102,0.00,0.07,0.10,0.12,-0.16,0.09,0.12,0.06,0.00,-0.31,-0.37,0.17,-0.22,-0.36,0.08,0.09,0.04,-0.21,-0.14,-0.22
+MIYS990103,-0.03,0.09,0.13,0.17,-0.36,0.13,0.23,0.09,-0.04,-0.33,-0.38,0.32,-0.30,-0.34,0.20,0.10,0.01,-0.24,-0.23,-0.29
+MIYS990104,-0.04,0.07,0.13,0.19,-0.38,0.14,0.23,0.09,-0.04,-0.34,-0.37,0.33,-0.30,-0.38,0.19,0.12,0.03,-0.33,-0.29,-0.29
+MIYS990105,-0.02,0.08,0.10,0.19,-0.32,0.15,0.21,-0.02,-0.02,-0.28,-0.32,0.30,-0.25,-0.33,0.11,0.11,0.05,-0.27,-0.23,-0.23
+ENGD860101,-1.6,12.3,4.8,9.2,-2.0,4.1,8.2,-1.0,3.0,-3.1,-2.8,8.8,-3.4,-3.7,0.2,-0.6,-1.2,-1.9,0.7,-2.6
+FASG890101,-0.21,2.11,0.96,1.36,-6.04,1.52,2.30,0.00,-1.23,-4.81,-4.68,3.88,-3.66,-4.65,0.75,1.74,0.78,-3.32,-1.01,-3.50
+KARS160101,2.00,8.00,5.00,5.00,3.00,6.00,6.00,1.00,7.00,5.00,5.00,6.00,5.00,8.00,4.00,3.00,4.00,11.00,9.00,4.00
+KARS160102,1.00,7.00,4.00,4.00,2.00,5.00,5.00,0.00,6.00,4.00,4.00,5.00,4.00,8.00,4.00,2.00,3.00,12.00,9.00,3.00
+KARS160103,2.00,12.00,8.00,8.00,4.00,10.00,10.00,0.00,14.00,8.00,8.00,10.00,8.00,14.00,8.00,4.00,6.00,24.00,18.00,6.00
+KARS160104,1.00,6.00,4.00,4.00,2.00,4.00,5.00,1.00,6.000,4.00,4.00,4.00,4.00,6.00,4.00,2.00,3.00,8.00,7.00,3.00
+KARS160105,1.00,8.120,5.00,5.17,2.33,5.860,6.00,0.00,6.71,3.25,5.00,7.00,5.40,7.00,4.00,1.670,3.250,11.10,8.88,3.25
+KARS160106,1.00,6.00,3.00,3.00,1.00,4.00,4.00,0.00,6.000,3.00,3.00,5.00,3.00,6.000,4.00,2.00,1.00,9.000,6.000,1.00
+KARS160107,1.00,12.00,6.00,6.00,3.00,8.00,8.00,0.00,9.00,6.00,6.00,9.00,7.00,11.000,4.000,3.00,4.00,14.000,13.000,4.00
+KARS160108,1.00,1.50,1.60,1.60,1.333,1.667,1.667,0.00,2.00,1.600,1.60,1.667,1.60,1.750,2.00,1.333,1.50,2.182,2.000,1.50
+KARS160109,2.00,12.499,11.539,11.539,6.243,12.207,11.530,0.00,12.876,10.851,11.029,10.363,9.49,14.851,12.00,5.00,9.928,13.511,12.868,9.928
+KARS160110,0.00,-4.307,-4.178,-4.178,-2.243,-4.255,-3.425,0.00,-3.721,-6.085,-4.729,-3.151,-2.812,-4.801,-4.00,1.00,-3.928,-6.324,-4.793,-3.928
+KARS160111,1.00,3.500,3.20,3.20,2.00,3.333,3.333,0.00,4.286,1.80,3.20,3.00,2.80,4.25,4.00,2.00,3.00,4.00,4.333,3.00
+KARS160112,2.00,-2.590,0.528,0.528,2.00,-1.043,-0.538,0.00,-1.185,-1.517,1.052,-0.536,0.678,-1.672,4.00,2.00,3.00,-2.576,-2.054,3.00
+KARS160113,6.00,19.00,12.00,12.00,6.00,12.00,12.00,1.00,15.00,12.00,12.00,12.00,18.00,18.00,12.00,6.00,6.00,24.00,18.00,6.00
+KARS160114,6.00,31.444,16.50,16.40,16.670,21.167,21.00,3.50,23.10,15.60,15.60,24.50,27.20,23.25,12.00,13.33,12.40,27.50,27.78,10.50
+KARS160115,6.00,20.00,14.00,12.00,12.00,15.00,14.00,1.00,18.00,12.00,12.00,18.00,18.00,18.00,12.00,8.00,8.00,18.00,20.00,6.00
+KARS160116,6.00,38.00,20.00,20.00,22.00,24.00,26.00,6.00,31.00,18.00,18.00,31.00,34.00,24.00,12.00,20.00,14.00,36.00,38.00,12.00
+KARS160117,12.00,45.00,33.007,34.00,28.00,39.00,40.00,7.00,47.00,30.00,30.00,37.00,40.00,48.00,24.00,22.00,27.00,68.00,56.00,24.007
+KARS160118,6.00,5.00,6.60,6.80,9.33,6.50,6.67,3.50,4.70,6.00,6.00,6.17,8.00,6.00,6.00,7.33,5.40,5.667,6.22,6.00
+KARS160119,12.00,23.343,27.708,28.634,28.00,27.831,28.731,7.00,24.243,24.841,25.021,22.739,31.344,26.993,24.00,20.00,23.819,29.778,28.252,24.00
+KARS160120,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,-1.734,-1.641,0.00,-0.179,0.00,0.00,0.00,0.00,-4.227,0.211,-0.96,0.00
+KARS160121,6.00,10.667,10.00,10.40,11.333,10.50,10.667,3.50,10.400,9.60,9.60,10.167,13.60,12.00,12.00,8.667,9.00,12.75,12.222,9.00
+KARS160122,0.00,4.20,3.00,2.969,6.00,1.849,1.822,0.00,1.605,3.373,3.113,1.372,2.656,2.026,12.00,6.00,6.00,2.044,1.599,6.00

apex/best_key_list ADDED Viewed

	@@ -0,0 +1,8 @@

+3&128&2048&1e-05&0.1&1.0
+3&256&2048&1e-06&0.1&1.0
+2&128&512&1e-05&0.01&1.0
+3&128&512&1e-05&0.001&1.0
+2&128&2048&1e-06&0.0&1.0
+3&256&512&1e-06&0.0&1.0
+2&128&2048&1e-05&0.01&1.0
+2&256&2048&1e-06&0.1&1.0

apex/predict.py ADDED Viewed

	@@ -0,0 +1,129 @@

+import os
+import json
+#from time import perf_counter
+import numpy as np
+import matplotlib.pyplot as plt
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import torch.optim as optim
+import math, copy, time
+from torch.autograd import Variable
+from scipy import stats
+import pandas as pd
+from sklearn.model_selection import KFold
+import pickle
+from sklearn.model_selection import train_test_split
+from torch.optim.lr_scheduler import StepLR
+import os.path
+from Bio import SeqIO
+import string
+import glob
+from sklearn.preprocessing import StandardScaler
+from sklearn.linear_model import ElasticNet
+from sklearn.svm import SVR
+from sklearn.ensemble import RandomForestRegressor
+from sklearn.model_selection import KFold, StratifiedKFold
+from sklearn.metrics import roc_auc_score, average_precision_score
+from sklearn.ensemble import RandomForestClassifier
+from AMP_DL_model_twohead import AMP_model
+#from propy.AAComposition import CalculateAADipeptideComposition
+from rdkit import Chem
+from rdkit.Chem import AllChem
+from scipy import stats
+from utils import *
+from scipy import sparse
+import sys
+from optparse import OptionParser
+import copy
+import pandas as pd
+col = ['E. coli ATCC11775', 'P. aeruginosa PAO1', 'P. aeruginosa PA14', 'S. aureus ATCC12600', 'E. coli AIG221', 'E. coli AIG222', 'K. pneumoniae ATCC13883', 'A. baumannii ATCC19606', 'A. muciniphila ATCC BAA-835', 'B. fragilis ATCC25285', 'B. vulgatus ATCC8482', 'C. aerofaciens ATCC25986', 'C. scindens ATCC35704', 'B. thetaiotaomicron ATCC29148', 'B. thetaiotaomicron Complemmented', 'B. thetaiotaomicron Mutant', 'B. uniformis ATCC8492', 'B. eggerthi ATCC27754', 'C. spiroforme ATCC29900', 'P. distasonis ATCC8503', 'P. copri DSMZ18205', 'B. ovatus ATCC8483', 'E. rectale ATCC33656', 'C. symbiosum', 'R. obeum', 'R. torques', 'S. aureus (ATCC BAA-1556) - MRSA', 'vancomycin-resistant E. faecalis ATCC700802', 'vancomycin-resistant E. faecium ATCC700221', 'E. coli Nissle', 'Salmonella enterica ATCC 9150 (BEIRES NR-515)', 'Salmonella enterica (BEIRES NR-170)', 'Salmonella enterica ATCC 9150 (BEIRES NR-174)', 'L. monocytogenes ATCC 19111 (BEIRES NR-106)']
+max_len = 52 # maximun peptide length
+word2idx, idx2word = make_vocab()
+emb, AAindex_dict = AAindex('./aaindex1.csv', word2idx)
+vocab_size = len(word2idx)
+emb_size = np.shape(emb)[1]
+model_num = 8
+repeat_num = 5
+f = open('./best_key_list', 'r')
+lines = f.readlines()
+f.close()
+model_list = []
+for line in lines:
+  parsed = line.strip('\n').strip('\r')
+  model_list.append(parsed)
+all_list = []
+ensemble_num = model_num * repeat_num
+deep_model_list = []
+for a_model_name in model_list:
+  for a_en in range(repeat_num):
+    key = 'trained_all_model_'+a_model_name+'_ensemble_'+str(a_en)
+    model = torch.load('./trained_models/'+key)
+    model.eval()
+    deep_model_list.append(model)
+seq_list = []
+f = open('./test_seqs.txt', 'r')
+lines = f.readlines()
+f.close()
+for line in lines:
+	seq_list.append(line.strip('\n').strip('\r'))
+seq_list = np.array(seq_list)
+ensemble_counter = 0
+for ensemble_id in range(ensemble_num):
+	AMP_model = deep_model_list[ensemble_id].cuda().eval()
+	data_len = len(seq_list)
+	batch_size = 3000 #change according to your GPU memory
+	for i in range(int(math.ceil(data_len/float(batch_size)))):
+		if (i*batch_size) % 1000 == 0:
+			print ('progress', i*batch_size, data_len)
+		seq_batch = seq_list[i*batch_size:(i+1)*batch_size]
+		seq_rep, _, _ = onehot_encoding(seq_batch, max_len, word2idx)
+		X_seq = torch.LongTensor(seq_rep).cuda()
+		AMP_pred_batch = AMP_model(X_seq).cpu().detach().numpy()
+		AMP_pred_batch = 10**(6-AMP_pred_batch) #transform back to MICs
+		if i == 0:
+			AMP_pred = AMP_pred_batch
+		else:
+			AMP_pred = np.vstack([AMP_pred, AMP_pred_batch])
+	if ensemble_id == 0:
+		AMP_sum = AMP_pred
+	else:
+		AMP_sum += AMP_pred
+	ensemble_counter += 1
+AMP_pred = AMP_sum / float(ensemble_counter)
+df = pd.DataFrame(data=AMP_pred, columns=col, index=seq_list)
+print (df)
+df.to_csv('Predicted_MICs.csv')

apex/requirement.txt ADDED Viewed

	@@ -0,0 +1,7 @@

+numpy==1.23
+scipy==1.10
+matplotlib==3.9.4
+pandas==2.2.3
+scikit-learn==1.6.1
+biopython==1.85
+rdkit==2024.3.2

apex/test_seqs.txt ADDED Viewed

	@@ -0,0 +1,10 @@

+IPKTYDKRWDDQCWLAITGRYHGITTPPCCSWVV
+KWLIYYNEGHLMVKYMLTISVRIPEGDNPNIQLHGSIGSR
+VGHAQVASPDLHWDGHGNHLIPWTPCYSHEMNPTMPPA
+RIWETQGSDCIRDGIDSTGPPFMVMFHAAGWRQVHSK
+IYEDYEFVRMPTHMTDFMQSPDQQNPKHMWTLCFDHT
+CPWVQHFWAPPWAHCICIEGPEESGWATIEPMVVGT
+FPLTMHGEFSQNLVWTITQHLVKRWCYTLSPKFCHRY
+SRSEDQILATYWRTSTCYFNQLWFQRLTGQQRICC
+QLELPCCIETWKLNVAFRCPFHKDLKRLGLYSRDKW
+PPMDCVYAIKTTSDHQSTMFIIPRYTHMYGNLQLWCVYCT

apex/utils.py ADDED Viewed

	@@ -0,0 +1,126 @@

+import os
+import json
+import csv
+import numpy as np
+import matplotlib.pyplot as plt
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import torch.optim as optim
+import math, copy, time
+from torch.autograd import Variable
+from scipy import stats
+import pandas as pd
+from sklearn.model_selection import KFold
+import pickle
+from sklearn.model_selection import train_test_split
+import os.path
+def make_vocab():
+    #0: pad
+    #1: start
+    #2: end
+    word2idx = {}
+    idx2word = {}
+    word2idx['0'] = 0
+    word2idx['1'] = 1
+    word2idx['2'] = 2
+    word2idx['A'] = 3
+    word2idx['C'] = 4
+    word2idx['D'] = 5
+    word2idx['E'] = 6
+    word2idx['F'] = 7
+    word2idx['G'] = 8
+    word2idx['H'] = 9
+    word2idx['I'] = 10
+    word2idx['K'] = 11
+    word2idx['L'] = 12
+    word2idx['M'] = 13
+    word2idx['N'] = 14
+    word2idx['P'] = 15
+    word2idx['Q'] = 16
+    word2idx['R'] = 17
+    word2idx['S'] = 18
+    word2idx['T'] = 19
+    word2idx['V'] = 20
+    word2idx['W'] = 21
+    word2idx['Y'] = 22
+    for key, value in word2idx.items():
+        idx2word[value] = key
+    return word2idx, idx2word
+def AAindex(path, word2idx):
+    with open(path) as csvfile:
+        reader = csv.reader(csvfile)
+        AAindex_dict = {}
+        AAindex_matrix = []
+        skip = 1
+        for row in reader:
+            if skip == 1:
+                skip = 0
+                header = np.array(row)[1:].tolist()
+                continue
+            tmp = []
+            for j in np.array(row)[1:]:
+                try:
+                    tmp.append(float(j))
+                except:
+                    tmp.append(0)
+            AAindex_matrix.append(np.array(tmp))
+        dim = np.shape(AAindex_matrix)[0]
+        AAindex_matrix = np.array(AAindex_matrix)
+        for i in range(len(header)):
+            AAindex_dict[header[i]] = AAindex_matrix[:, i]
+    #print (AAindex_matrix)
+    emb = np.zeros((len(word2idx), dim))
+    for key, value in word2idx.items():
+        if key in AAindex_dict:
+            emb[value] = AAindex_dict[key]
+        else:
+            pass
+    return emb, AAindex_dict
+def onehot_encoding(seq_list_, max_len, word2idx):
+    #0: pad
+    #1: start
+    #2: end
+    seq_list = [i for i in seq_list_]
+    X = np.zeros((len(seq_list), max_len)).astype(int)
+    AA_mask = []
+    nonAA_mask = []
+    for i in range(len(seq_list)):
+        if len(seq_list[i]) >= max_len - 2:
+            a_seq = '1' + seq_list[i][:max_len-2].upper() + '2'
+        else:
+            a_seq = '1' + seq_list[i].upper() + '2'
+        if len(a_seq) > max_len:
+            iter_num = max_len
+        else:
+            iter_num = len(a_seq)
+        for j in range(iter_num):
+            if a_seq[j] not in word2idx:
+                continue
+            else:
+                X[i,j] = word2idx[a_seq[j]]
+        tmp = np.zeros(max_len)
+        tmp[1:iter_num+1] = 1
+        AA_mask.append(tmp.astype(int))
+        nonAA_mask.append((1-tmp).astype(int))
+    return np.array(X), np.array(AA_mask), np.array(nonAA_mask)

cfg_dataset.py ADDED Viewed

	@@ -0,0 +1,324 @@

+import torch
+import torch.nn as nn
+from torch.utils.data import Dataset, DataLoader
+import numpy as np
+import json
+import os
+from typing import Dict, List, Tuple, Optional
+import random
+class CFGUniProtDataset(Dataset):
+    """
+    Dataset class for UniProt sequences with classifier-free guidance.
+    This dataset:
+    1. Loads processed UniProt data with AMP classifications
+    2. Handles label masking for CFG training
+    3. Integrates with your existing flow training pipeline
+    4. Provides sequences, labels, and masking information
+    """
+    def __init__(self,
+                 data_path: str,
+                 use_masked_labels: bool = True,
+                 mask_probability: float = 0.1,
+                 max_seq_len: int = 50,
+                 device: str = 'cuda'):
+        self.data_path = data_path
+        self.use_masked_labels = use_masked_labels
+        self.mask_probability = mask_probability
+        self.max_seq_len = max_seq_len
+        self.device = device
+        # Load processed data
+        self._load_data()
+        # Label mapping
+        self.label_map = {
+            0: 'amp',      # MIC < 100
+            1: 'non_amp',  # MIC > 100
+            2: 'mask'      # Unknown MIC
+        }
+        print(f"CFG Dataset initialized:")
+        print(f"  Total sequences: {len(self.sequences)}")
+        print(f"  Using masked labels: {use_masked_labels}")
+        print(f"  Mask probability: {mask_probability}")
+        print(f"  Label distribution: {self._get_label_distribution()}")
+    def _load_data(self):
+        """Load processed UniProt data."""
+        if os.path.exists(self.data_path):
+            with open(self.data_path, 'r') as f:
+                data = json.load(f)
+            self.sequences = data['sequences']
+            self.original_labels = np.array(data['original_labels'])
+            self.masked_labels = np.array(data['masked_labels'])
+            self.mask_indices = set(data['mask_indices'])
+        else:
+            raise FileNotFoundError(f"Data file not found: {self.data_path}")
+    def _get_label_distribution(self) -> Dict[str, int]:
+        """Get distribution of labels in the dataset."""
+        labels = self.masked_labels if self.use_masked_labels else self.original_labels
+        unique, counts = np.unique(labels, return_counts=True)
+        return {self.label_map[label]: count for label, count in zip(unique, counts)}
+    def __len__(self) -> int:
+        return len(self.sequences)
+    def __getitem__(self, idx: int) -> Dict[str, torch.Tensor]:
+        """Get a single sample with sequence and label."""
+        sequence = self.sequences[idx]
+        # Get appropriate label
+        if self.use_masked_labels:
+            label = self.masked_labels[idx]
+        else:
+            label = self.original_labels[idx]
+        # Check if this sample was masked
+        is_masked = idx in self.mask_indices
+        return {
+            'sequence': sequence,
+            'label': torch.tensor(label, dtype=torch.long),
+            'original_label': torch.tensor(self.original_labels[idx], dtype=torch.long),
+            'is_masked': torch.tensor(is_masked, dtype=torch.bool),
+            'index': torch.tensor(idx, dtype=torch.long)
+        }
+    def get_label_statistics(self) -> Dict[str, Dict]:
+        """Get detailed statistics about labels."""
+        stats = {
+            'original': self._get_label_distribution(),
+            'masked': self._get_label_distribution() if self.use_masked_labels else None,
+            'masking_info': {
+                'total_masked': len(self.mask_indices),
+                'mask_probability': self.mask_probability,
+                'masked_indices': list(self.mask_indices)
+            }
+        }
+        return stats
+class CFGFlowDataset(Dataset):
+    """
+    Dataset that integrates CFG labels with your existing flow training pipeline.
+    This dataset:
+    1. Loads your existing AMP embeddings
+    2. Adds CFG labels from UniProt processing
+    3. Handles the integration between embeddings and labels
+    4. Provides data in the format expected by your flow training
+    """
+    def __init__(self,
+                 embeddings_path: str,
+                 cfg_data_path: str,
+                 use_masked_labels: bool = True,
+                 max_seq_len: int = 50,
+                 device: str = 'cuda'):
+        self.embeddings_path = embeddings_path
+        self.cfg_data_path = cfg_data_path
+        self.use_masked_labels = use_masked_labels
+        self.max_seq_len = max_seq_len
+        self.device = device
+        # Load data
+        self._load_embeddings()
+        self._load_cfg_data()
+        self._align_data()
+        print(f"CFG Flow Dataset initialized:")
+        print(f"  AMP embeddings: {self.embeddings.shape}")
+        print(f"  CFG labels: {len(self.cfg_labels)}")
+        print(f"  Aligned samples: {len(self.aligned_indices)}")
+    def _load_embeddings(self):
+        """Load your existing AMP embeddings."""
+        print(f"Loading AMP embeddings from {self.embeddings_path}...")
+        # Try to load the combined embeddings file first (FULL DATA)
+        combined_path = os.path.join(self.embeddings_path, "all_peptide_embeddings.pt")
+        if os.path.exists(combined_path):
+            print(f"Loading combined embeddings from {combined_path} (FULL DATA)...")
+            # Load on CPU first to avoid CUDA issues with DataLoader workers
+            self.embeddings = torch.load(combined_path, map_location='cpu')
+            print(f"✓ Loaded ALL embeddings: {self.embeddings.shape}")
+        else:
+            print("Combined embeddings file not found, loading individual files...")
+            # Fallback to individual files
+            import glob
+            embedding_files = glob.glob(os.path.join(self.embeddings_path, "*.pt"))
+            embedding_files = [f for f in embedding_files if not f.endswith('metadata.json') and not f.endswith('sequence_ids.json') and not f.endswith('all_peptide_embeddings.pt')]
+            print(f"Found {len(embedding_files)} individual embedding files")
+            # Load and stack all embeddings
+            embeddings_list = []
+            for file_path in embedding_files:
+                try:
+                    embedding = torch.load(file_path, map_location='cpu')
+                    if embedding.dim() == 2:  # (seq_len, hidden_dim)
+                        embeddings_list.append(embedding)
+                    else:
+                        print(f"Warning: Skipping {file_path} - unexpected shape {embedding.shape}")
+                except Exception as e:
+                    print(f"Warning: Could not load {file_path}: {e}")
+            if not embeddings_list:
+                raise ValueError("No valid embeddings found!")
+            self.embeddings = torch.stack(embeddings_list)
+            print(f"Loaded {len(self.embeddings)} embeddings from individual files")
+    def _load_cfg_data(self):
+        """Load CFG data from UniProt processing."""
+        print(f"Loading CFG data from {self.cfg_data_path}...")
+        with open(self.cfg_data_path, 'r') as f:
+            cfg_data = json.load(f)
+        self.cfg_sequences = cfg_data['sequences']
+        self.cfg_original_labels = np.array(cfg_data['labels'])
+        # For CFG training, we need to create masked labels
+        # Randomly mask 10% of labels for CFG training
+        self.cfg_masked_labels = self.cfg_original_labels.copy()
+        mask_probability = 0.1
+        mask_indices = np.random.choice(
+            len(self.cfg_original_labels),
+            size=int(len(self.cfg_original_labels) * mask_probability),
+            replace=False
+        )
+        self.cfg_masked_labels[mask_indices] = 2  # 2 = mask/unknown
+        self.cfg_mask_indices = set(mask_indices)
+        print(f"Loaded {len(self.cfg_sequences)} CFG sequences")
+        print(f"Label distribution: {np.bincount(self.cfg_original_labels)}")
+        print(f"Masked {len(self.cfg_mask_indices)} labels for CFG training")
+    def _align_data(self):
+        """Align AMP embeddings with CFG data based on sequence matching."""
+        print("Aligning AMP embeddings with CFG data...")
+        # For now, we'll use a simple approach: take the first N sequences
+        # where N is the minimum of embeddings and CFG data
+        min_samples = min(len(self.embeddings), len(self.cfg_sequences))
+        self.aligned_indices = list(range(min_samples))
+        # Align labels
+        if self.use_masked_labels:
+            self.cfg_labels = self.cfg_masked_labels[:min_samples]
+        else:
+            self.cfg_labels = self.cfg_original_labels[:min_samples]
+        # Align embeddings
+        self.aligned_embeddings = self.embeddings[:min_samples]
+        print(f"Aligned {min_samples} samples")
+    def __len__(self) -> int:
+        return len(self.aligned_indices)
+    def __getitem__(self, idx: int) -> Dict[str, torch.Tensor]:
+        """Get a single sample with embedding and CFG label."""
+        # Embeddings are already on CPU
+        embedding = self.aligned_embeddings[idx]
+        label = self.cfg_labels[idx]
+        original_label = self.cfg_original_labels[idx]
+        is_masked = idx in self.cfg_mask_indices
+        return {
+            'embedding': embedding,
+            'label': torch.tensor(label, dtype=torch.long),
+            'original_label': torch.tensor(original_label, dtype=torch.long),
+            'is_masked': torch.tensor(is_masked, dtype=torch.bool),
+            'index': torch.tensor(idx, dtype=torch.long)
+        }
+    def get_embedding_stats(self) -> Dict:
+        """Get statistics about the embeddings."""
+        return {
+            'shape': self.aligned_embeddings.shape,
+            'mean': self.aligned_embeddings.mean().item(),
+            'std': self.aligned_embeddings.std().item(),
+            'min': self.aligned_embeddings.min().item(),
+            'max': self.aligned_embeddings.max().item()
+        }
+def create_cfg_dataloader(dataset: Dataset,
+                         batch_size: int = 32,
+                         shuffle: bool = True,
+                         num_workers: int = 4) -> DataLoader:
+    """Create a DataLoader for CFG training."""
+    def collate_fn(batch):
+        """Custom collate function for CFG data."""
+        # Separate different types of data
+        embeddings = torch.stack([item['embedding'] for item in batch])
+        labels = torch.stack([item['label'] for item in batch])
+        original_labels = torch.stack([item['original_label'] for item in batch])
+        is_masked = torch.stack([item['is_masked'] for item in batch])
+        indices = torch.stack([item['index'] for item in batch])
+        return {
+            'embeddings': embeddings,
+            'labels': labels,
+            'original_labels': original_labels,
+            'is_masked': is_masked,
+            'indices': indices
+        }
+    return DataLoader(
+        dataset,
+        batch_size=batch_size,
+        shuffle=shuffle,
+        num_workers=num_workers,
+        collate_fn=collate_fn,
+        pin_memory=True
+    )
+def test_cfg_dataset():
+    """Test function to verify the CFG dataset works correctly."""
+    print("Testing CFG Dataset...")
+    # Test with a small subset
+    test_data = {
+        'sequences': ['MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG',
+                     'MKLLIVTFCLTFAAL',
+                     'MKLLIVTFCLTFAALMKLLIVTFCLTFAAL'],
+        'original_labels': [0, 1, 0],  # amp, non_amp, amp
+        'masked_labels': [0, 2, 0],    # amp, mask, amp
+        'mask_indices': [1]  # Only second sequence is masked
+    }
+    # Save test data
+    test_path = 'test_cfg_data.json'
+    with open(test_path, 'w') as f:
+        json.dump(test_data, f)
+    # Test dataset
+    dataset = CFGUniProtDataset(test_path, use_masked_labels=True)
+    print(f"Dataset length: {len(dataset)}")
+    for i in range(len(dataset)):
+        sample = dataset[i]
+        print(f"Sample {i}:")
+        print(f"  Sequence: {sample['sequence'][:20]}...")
+        print(f"  Label: {sample['label'].item()}")
+        print(f"  Original Label: {sample['original_label'].item()}")
+        print(f"  Is Masked: {sample['is_masked'].item()}")
+    # Clean up
+    os.remove(test_path)
+    print("Test completed successfully!")
+if __name__ == "__main__":
+    test_cfg_dataset()

compressor_with_embeddings.py ADDED Viewed

	@@ -0,0 +1,278 @@

+import torch
+import torch.nn as nn
+import torch.optim as optim
+from torch.utils.data import Dataset, DataLoader
+from torch.optim.lr_scheduler import LinearLR, CosineAnnealingLR, SequentialLR
+import json
+import numpy as np
+from tqdm import tqdm
+# ---------------- Hyperparameters ----------------
+ESM_DIM      = 1280  # ESM-2 hidden dim (esm2_t33_650M_UR50D)
+COMP_RATIO   = 16    # compression factor
+COMP_DIM     = ESM_DIM // COMP_RATIO
+MAX_SEQ_LEN  = 50    # Actual sequence length from final_sequence_encoder.py
+BATCH_SIZE   = 32
+EPOCHS       = 30
+BASE_LR       = 1e-3  # initial learning rate
+LR_MIN        = 8e-5  # minimum learning rate for cosine schedule
+WARMUP_STEPS = 10_000
+DEPTH        = 4     # total transformer layers (2 pre-pool, 2 post-pool)
+HEADS        = 8     # attention heads
+DIM_FF       = ESM_DIM * 4
+POOLING      = True  # enforce ProtFlow hourglass pooling
+# ---------------- Dataset for Pre-computed Embeddings ----------------
+class PrecomputedEmbeddingDataset(Dataset):
+    def __init__(self, embeddings_path):
+        """
+        Load pre-computed embeddings from the final_sequence_encoder.py output.
+        Args:
+            embeddings_path: Path to the directory containing individual .pt embedding files
+        """
+        print(f"Loading pre-computed embeddings from {embeddings_path}...")
+        # Load all individual embedding files
+        import glob
+        import os
+        embedding_files = glob.glob(os.path.join(embeddings_path, "*.pt"))
+        embedding_files = [f for f in embedding_files if not f.endswith('metadata.json') and not f.endswith('sequence_ids.json')]
+        print(f"Found {len(embedding_files)} embedding files")
+        # Load and stack all embeddings
+        embeddings_list = []
+        for file_path in embedding_files:
+            try:
+                embedding = torch.load(file_path)
+                if embedding.dim() == 2:  # (seq_len, hidden_dim)
+                    embeddings_list.append(embedding)
+                else:
+                    print(f"Warning: Skipping {file_path} - unexpected shape {embedding.shape}")
+            except Exception as e:
+                print(f"Warning: Could not load {file_path}: {e}")
+        if not embeddings_list:
+            raise ValueError("No valid embeddings found!")
+        self.embeddings = torch.stack(embeddings_list)
+        print(f"Loaded {len(self.embeddings)} embeddings with shape {self.embeddings.shape}")
+        # Ensure embeddings are the right shape
+        if len(self.embeddings.shape) != 3:
+            raise ValueError(f"Expected 3D tensor, got shape {self.embeddings.shape}")
+        if self.embeddings.shape[1] != MAX_SEQ_LEN:
+            print(f"Warning: Expected sequence length {MAX_SEQ_LEN}, got {self.embeddings.shape[1]}")
+        if self.embeddings.shape[2] != ESM_DIM:
+            print(f"Warning: Expected embedding dim {ESM_DIM}, got {self.embeddings.shape[2]}")
+    def __len__(self):
+        return len(self.embeddings)
+    def __getitem__(self, idx):
+        return self.embeddings[idx]
+# ---------------- Compressor ----------------
+class Compressor(nn.Module):
+    def __init__(self, in_dim=ESM_DIM, out_dim=COMP_DIM):
+        super().__init__()
+        self.norm = nn.LayerNorm(in_dim)
+        layer = lambda: nn.TransformerEncoderLayer(
+            d_model=in_dim, nhead=HEADS, dim_feedforward=DIM_FF,
+            batch_first=True)
+        # two layers before pool, two after
+        self.pre_tr  = nn.TransformerEncoder(layer(), num_layers=DEPTH//2)
+        self.post_tr = nn.TransformerEncoder(layer(), num_layers=DEPTH//2)
+        self.proj    = nn.Sequential(
+            nn.LayerNorm(in_dim),
+            nn.Linear(in_dim, out_dim),
+            nn.Tanh()
+        )
+        self.pooling = POOLING
+    def forward(self, x, stats=None):
+        if stats:
+            m, s, mn, mx = stats['mean'], stats['std'], stats['min'], stats['max']
+            # Move stats to the same device as x
+            m = m.to(x.device)
+            s = s.to(x.device)
+            mn = mn.to(x.device)
+            mx = mx.to(x.device)
+            x = torch.clamp((x - m) / s, -4, 4)
+            x = torch.clamp((x - mn) / (mx - mn + 1e-8), 0, 1)
+        x = self.norm(x)
+        x = self.pre_tr(x)                 # [B, L, D]
+        if self.pooling:
+            B, L, D = x.shape
+            if L % 2: x = x[:, :-1, :]
+            x = x.view(B, L//2, 2, D).mean(2)  # halve sequence length
+        x = self.post_tr(x)                # [B, L' , D]
+        return self.proj(x)                # [B, L', COMP_DIM]
+# ---------------- Decompressor ----------------
+class Decompressor(nn.Module):
+    def __init__(self, in_dim=COMP_DIM, out_dim=ESM_DIM):
+        super().__init__()
+        self.proj    = nn.Sequential(
+            nn.LayerNorm(in_dim),
+            nn.Linear(in_dim, out_dim)
+        )
+        layer = lambda: nn.TransformerEncoderLayer(
+            d_model=out_dim, nhead=HEADS, dim_feedforward=DIM_FF,
+            batch_first=True)
+        self.decoder = nn.TransformerEncoder(layer(), num_layers=DEPTH//2)
+        self.pooling = POOLING
+    def forward(self, z):
+        x = self.proj(z)                   # [B, L', D]
+        if self.pooling:
+            x = x.repeat_interleave(2, dim=1)  # unpool to full length
+        return self.decoder(x)             # [B, L, out_dim]
+# ---------------- Training Loop ----------------
+def train_with_precomputed_embeddings(embeddings_path, device='cuda'):
+    """
+    Train compressor using pre-computed embeddings from final_sequence_encoder.py
+    """
+    # Load dataset
+    ds = PrecomputedEmbeddingDataset(embeddings_path)
+    # Compute normalization statistics
+    print("Computing normalization statistics...")
+    flat = ds.embeddings.view(-1, ESM_DIM)
+    stats = {
+        'mean': flat.mean(0),
+        'std':  flat.std(0) + 1e-8,
+        'min':  torch.clamp((flat - flat.mean(0)) / (flat.std(0) + 1e-8), -4,4).min(0)[0],
+        'max':  torch.clamp((flat - flat.mean(0)) / (flat.std(0) + 1e-8), -4,4).max(0)[0]
+    }
+    # Save statistics for later use
+    torch.save(stats, 'normalization_stats.pt')
+    print("Saved normalization statistics to normalization_stats.pt")
+    # Create data loader
+    dl = DataLoader(ds, batch_size=BATCH_SIZE, shuffle=True)
+    # Initialize models
+    comp = Compressor().to(device)
+    decomp = Decompressor().to(device)
+    # Initialize optimizer
+    opt = optim.AdamW(list(comp.parameters()) + list(decomp.parameters()), lr=BASE_LR)
+    # LR scheduling: warmup -> cosine
+    warmup_sched = LinearLR(opt, start_factor=1e-8, end_factor=1.0, total_iters=WARMUP_STEPS)
+    cosine_sched = CosineAnnealingLR(opt, T_max=EPOCHS*len(dl), eta_min=LR_MIN)
+    sched = SequentialLR(opt, [warmup_sched, cosine_sched], milestones=[WARMUP_STEPS])
+    print(f"Starting training for {EPOCHS} epochs...")
+    print(f"Device: {device}")
+    print(f"Batch size: {BATCH_SIZE}")
+    print(f"Total batches per epoch: {len(dl)}")
+    # Training loop
+    for epoch in range(1, EPOCHS+1):
+        total_loss = 0
+        comp.train()
+        decomp.train()
+        for batch_idx, x in enumerate(tqdm(dl, desc=f"Epoch {epoch}/{EPOCHS}")):
+            x = x.to(device)
+            z = comp(x, stats)
+            xr = decomp(z)
+            loss = (x - xr).pow(2).mean()
+            opt.zero_grad()
+            loss.backward()
+            opt.step()
+            sched.step()
+            total_loss += loss.item()
+            # Print progress every 100 batches
+            if batch_idx % 100 == 0:
+                print(f"  Batch {batch_idx}/{len(dl)} - Loss: {loss.item():.6f}")
+        avg_loss = total_loss / len(dl)
+        print(f"Epoch {epoch}/{EPOCHS} — Average MSE: {avg_loss:.6f}")
+        # Save checkpoint every 5 epochs
+        if epoch % 5 == 0:
+            torch.save({
+                'epoch': epoch,
+                'compressor_state_dict': comp.state_dict(),
+                'decompressor_state_dict': decomp.state_dict(),
+                'optimizer_state_dict': opt.state_dict(),
+                'loss': avg_loss,
+            }, f'checkpoint_epoch_{epoch}.pth')
+    # Save final models
+    torch.save(comp.state_dict(), 'compressor_final.pth')
+    torch.save(decomp.state_dict(), 'decompressor_final.pth')
+    print("Training completed! Models saved as compressor_final.pth and decompressor_final.pth")
+# ---------------- Utility Functions ----------------
+def load_and_test_models(compressor_path, decompressor_path, embeddings_path, device='cuda'):
+    """
+    Load trained models and test reconstruction quality
+    """
+    print("Loading trained models...")
+    comp = Compressor().to(device)
+    decomp = Decompressor().to(device)
+    comp.load_state_dict(torch.load(compressor_path))
+    decomp.load_state_dict(torch.load(decompressor_path))
+    comp.eval()
+    decomp.eval()
+    # Load test data
+    ds = PrecomputedEmbeddingDataset(embeddings_path)
+    test_loader = DataLoader(ds, batch_size=16, shuffle=False)
+    # Load normalization stats
+    stats = torch.load('normalization_stats.pt')
+    print("Testing reconstruction quality...")
+    total_mse = 0
+    total_samples = 0
+    with torch.no_grad():
+        for batch in tqdm(test_loader, desc="Testing"):
+            x = batch.to(device)
+            z = comp(x, stats)
+            xr = decomp(z)
+            mse = (x - xr).pow(2).mean()
+            total_mse += mse.item() * len(x)
+            total_samples += len(x)
+    avg_mse = total_mse / total_samples
+    print(f"Average reconstruction MSE: {avg_mse:.6f}")
+    return avg_mse
+# ---------------- Entrypoint ----------------
+if __name__ == '__main__':
+    import argparse
+    parser = argparse.ArgumentParser(description='Train protein compressor with pre-computed embeddings')
+    parser.add_argument('--embeddings', type=str, default='/data2/edwardsun/flow_project/compressor_dataset/peptide_embeddings.pt',
+                       help='Path to pre-computed embeddings from final_sequence_encoder.py')
+    parser.add_argument('--device', type=str, default='cuda', help='Device to use (cuda/cpu)')
+    parser.add_argument('--test', action='store_true', help='Test existing models instead of training')
+    args = parser.parse_args()
+    device = torch.device(args.device if torch.cuda.is_available() else 'cpu')
+    print(f"Using device: {device}")
+    if args.test:
+        # Test existing models
+        load_and_test_models('compressor_final.pth', 'decompressor_final.pth', args.embeddings, device)
+    else:
+        # Train new models
+        train_with_precomputed_embeddings(args.embeddings, device)

final_flow_model.py ADDED Viewed

	@@ -0,0 +1,310 @@

+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import math
+class SinusoidalTimeEmbedding(nn.Module):
+    """Sinusoidal time embedding as used in ProtFlow paper."""
+    def __init__(self, dim):
+        super().__init__()
+        self.dim = dim
+    def forward(self, time):
+        device = time.device
+        half_dim = self.dim // 2
+        embeddings = math.log(10000) / (half_dim - 1)
+        embeddings = torch.exp(torch.arange(half_dim, device=device) * -embeddings)
+        # Ensure time is 2D: [B, 1] and embeddings is 1D: [half_dim]
+        if time.dim() > 2:
+            time = time.squeeze()  # Remove extra dimensions
+        embeddings = time.unsqueeze(-1) * embeddings.unsqueeze(0)  # [B, half_dim]
+        embeddings = torch.cat((embeddings.sin(), embeddings.cos()), dim=-1)  # [B, dim]
+        # Ensure output is exactly 2D
+        if embeddings.dim() > 2:
+            embeddings = embeddings.squeeze()
+        return embeddings
+class LabelMLP(nn.Module):
+    """
+    MLP for processing class labels into embeddings.
+    This approach processes labels separately from time embeddings.
+    """
+    def __init__(self, num_classes=3, hidden_dim=480, mlp_dim=256):
+        super().__init__()
+        self.num_classes = num_classes
+        # MLP to process labels
+        self.label_mlp = nn.Sequential(
+            nn.Embedding(num_classes, mlp_dim),
+            nn.Linear(mlp_dim, mlp_dim),
+            nn.GELU(),
+            nn.Linear(mlp_dim, hidden_dim),
+            nn.GELU(),
+            nn.Linear(hidden_dim, hidden_dim)
+        )
+        # Initialize embeddings
+        nn.init.normal_(self.label_mlp[0].weight, std=0.02)
+    def forward(self, labels):
+        """
+        Args:
+            labels: (B,) tensor of class labels
+                   - 0: AMP (MIC < 100)
+                   - 1: Non-AMP (MIC >= 100)
+                   - 2: Mask (Unknown MIC)
+        Returns:
+            embeddings: (B, hidden_dim) tensor of processed label embeddings
+        """
+        return self.label_mlp(labels)
+class AMPFlowMatcherCFGConcat(nn.Module):
+    """
+    Flow Matching model with Classifier-Free Guidance using concatenation approach.
+    - 12-layer transformer with long skip connections
+    - Time embedding + MLP-processed label embedding (concatenated then projected)
+    - Optimized for peptide sequences (max length 50)
+    """
+    def __init__(self, hidden_dim=480, compressed_dim=30, n_layers=12, n_heads=16,
+                 dim_ff=3072, dropout=0.1, max_seq_len=25, use_cfg=True):
+        super().__init__()
+        self.hidden_dim = hidden_dim
+        self.compressed_dim = compressed_dim
+        self.n_layers = n_layers
+        self.max_seq_len = max_seq_len
+        self.use_cfg = use_cfg
+        # Time embedding
+        self.time_embed = nn.Sequential(
+            SinusoidalTimeEmbedding(hidden_dim),
+            nn.Linear(hidden_dim, hidden_dim),
+            nn.GELU(),
+            nn.Linear(hidden_dim, hidden_dim)
+        )
+        # CFG components using concatenation approach
+        if use_cfg:
+            self.label_mlp = LabelMLP(num_classes=3, hidden_dim=hidden_dim)
+            # Projection layer for concatenated time + label embeddings
+            self.condition_proj = nn.Sequential(
+                nn.Linear(hidden_dim * 2, hidden_dim),  # 2 for time + label
+                nn.GELU(),
+                nn.Linear(hidden_dim, hidden_dim)
+            )
+        # Projection layers for compressed space
+        self.compress_proj = nn.Linear(compressed_dim, hidden_dim)
+        self.decompress_proj = nn.Linear(hidden_dim, compressed_dim)
+        # Positional encoding for peptide sequences
+        self.pos_embed = nn.Parameter(torch.randn(1, max_seq_len, hidden_dim))
+        # Transformer layers with long skip connections
+        self.layers = nn.ModuleList([
+            nn.TransformerEncoderLayer(
+                d_model=hidden_dim,
+                nhead=n_heads,
+                dim_feedforward=dim_ff,
+                dropout=dropout,
+                activation='gelu',
+                batch_first=True
+            ) for _ in range(n_layers)
+        ])
+        # Long skip connections (U-ViT style)
+        self.skip_projections = nn.ModuleList([
+            nn.Linear(hidden_dim, hidden_dim) for _ in range(n_layers - 1)
+        ])
+        # Output projection
+        self.output_proj = nn.Linear(hidden_dim, compressed_dim)
+    def forward(self, x, t, labels=None, mask=None):
+        """
+        Args:
+            x: compressed latent (B, L, compressed_dim) - AMP embeddings
+            t: time scalar (B,) or (B, 1)
+            labels: class labels (B,) for CFG - 0=AMP, 1=Non-AMP, 2=Mask
+            mask: attention mask (B, L) if needed
+        """
+        B, L, D = x.shape
+        # Project to hidden dimension
+        x = self.compress_proj(x)  # (B, L, hidden_dim)
+        # Add positional encoding
+        if L <= self.max_seq_len:
+            x = x + self.pos_embed[:, :L, :]
+        # Time embedding - ensure t is 2D (B, 1)
+        if t.dim() == 1:
+            t = t.unsqueeze(-1)  # (B, 1)
+        elif t.dim() > 2:
+            t = t.squeeze()  # Remove extra dimensions
+            if t.dim() == 1:
+                t = t.unsqueeze(-1)  # (B, 1)
+        t_emb = self.time_embed(t)  # (B, hidden_dim)
+        # Ensure t_emb is 2D before expanding
+        if t_emb.dim() > 2:
+            t_emb = t_emb.squeeze()  # Remove extra dimensions
+        t_emb = t_emb.unsqueeze(1).expand(-1, L, -1)  # (B, L, hidden_dim)
+        # CFG: Process label embedding if enabled
+        if self.use_cfg and labels is not None:
+            # Process labels through MLP
+            label_emb = self.label_mlp(labels)  # (B, hidden_dim)
+            label_emb = label_emb.unsqueeze(1).expand(-1, L, -1)  # (B, L, hidden_dim)
+            # Professor's approach: Concatenate time and label embeddings
+            combined_emb = torch.cat([t_emb, label_emb], dim=-1)  # (B, L, hidden_dim*2)
+            projected_emb = self.condition_proj(combined_emb)  # (B, L, hidden_dim)
+        else:
+            projected_emb = t_emb  # Just use time embedding if no CFG
+        # Store intermediate representations for skip connections
+        skip_features = []
+        # Pass through transformer layers with skip connections
+        for i, layer in enumerate(self.layers):
+            # Add skip connection from earlier layers
+            if i > 0 and i < len(self.layers) - 1:
+                skip_feat = skip_features[i-1]
+                skip_feat = self.skip_projections[i-1](skip_feat)
+                x = x + skip_feat
+            # Store current features for future skip connections
+            if i < len(self.layers) - 1:
+                skip_features.append(x.clone())
+            # Add projected condition embedding to EACH layer
+            x = x + projected_emb
+            # Apply transformer layer
+            x = layer(x, src_key_padding_mask=mask)
+        # Project back to compressed dimension
+        x = self.output_proj(x)  # (B, L, compressed_dim)
+        return x
+class AMPProtFlowPipelineCFG:
+    """
+    Complete ProtFlow pipeline for AMP generation with CFG.
+    """
+    def __init__(self, compressor, decompressor, flow_model, device='cuda'):
+        self.compressor = compressor
+        self.decompressor = decompressor
+        self.flow_model = flow_model
+        self.device = device
+        # Load normalization stats
+        self.stats = torch.load('normalization_stats.pt', map_location=device)
+    def generate_amps_cfg(self, num_samples=100, num_steps=25, cfg_scale=7.5,
+                         condition_label=0):
+        """
+        Generate AMP samples using CFG.
+        Args:
+            num_samples: Number of samples to generate
+            num_steps: Number of ODE solving steps
+            cfg_scale: CFG guidance scale (higher = stronger conditioning)
+            condition_label: 0=AMP, 1=Non-AMP, 2=Mask
+        """
+        print(f"Generating {num_samples} samples with CFG (label={condition_label}, scale={cfg_scale})...")
+        # Sample random noise
+        batch_size = min(num_samples, 32)  # Process in batches
+        all_samples = []
+        for i in range(0, num_samples, batch_size):
+            current_batch = min(batch_size, num_samples - i)
+            # Initialize with noise
+            eps = torch.randn(current_batch, self.flow_model.max_seq_len,
+                            self.flow_model.compressed_dim, device=self.device)
+            # ODE solving steps with CFG
+            xt = eps.clone()
+            for step in range(num_steps):
+                t = torch.ones(current_batch, device=self.device) * (1.0 - step/num_steps)
+                # CFG: Generate with condition and without condition
+                if cfg_scale > 0:
+                    # With condition
+                    vt_cond = self.flow_model(xt, t,
+                                            labels=torch.full((current_batch,), condition_label,
+                                                            device=self.device))
+                    # Without condition (mask)
+                    vt_uncond = self.flow_model(xt, t,
+                                              labels=torch.full((current_batch,), 2,
+                                                              device=self.device))
+                    # CFG interpolation
+                    vt = vt_uncond + cfg_scale * (vt_cond - vt_uncond)
+                else:
+                    # No CFG, use mask label
+                    vt = self.flow_model(xt, t,
+                                       labels=torch.full((current_batch,), 2,
+                                                       device=self.device))
+                # Euler step for backward integration (t: 1 -> 0)
+                # Use negative dt to integrate backward from noise to data
+                dt = -1.0 / num_steps
+                xt = xt + vt * dt
+            all_samples.append(xt)
+        # Concatenate all batches
+        generated = torch.cat(all_samples, dim=0)
+        # Decompress and decode
+        with torch.no_grad():
+            # Decompress
+            decompressed = self.decompressor(generated)
+            # Apply reverse normalization
+            m, s, mn, mx = self.stats['mean'], self.stats['std'], self.stats['min'], self.stats['max']
+            decompressed = decompressed * (mx - mn + 1e-8) + mn
+            decompressed = decompressed * s + m
+        return generated, decompressed
+# Example usage
+if __name__ == "__main__":
+    # Initialize FINAL AMP flow model with CFG using concatenation approach
+    flow_model = AMPFlowMatcherCFGConcat(
+        hidden_dim=480,
+        compressed_dim=30,  # 16x compression of 480
+        n_layers=12,
+        n_heads=16,
+        dim_ff=3072,
+        max_seq_len=25,  # For AMP sequences (max 50, halved by pooling)
+        use_cfg=True
+    )
+    print(f"FINAL AMP Flow Model with CFG (Concat+Proj) parameters: {sum(p.numel() for p in flow_model.parameters()):,}")
+    # Test forward pass
+    batch_size = 4
+    seq_len = 20
+    compressed_dim = 30
+    x = torch.randn(batch_size, seq_len, compressed_dim)
+    t = torch.rand(batch_size)
+    labels = torch.randint(0, 3, (batch_size,))  # Random labels
+    with torch.no_grad():
+        output = flow_model(x, t, labels=labels)
+        print(f"Input shape: {x.shape}")
+        print(f"Output shape: {output.shape}")
+        print(f"Time embedding shape: {t.shape}")
+        print(f"Labels: {labels}")
+    print("🎯 FINAL AMP Flow Model with CFG (Concat+Proj) ready for training!")

final_sequence_decoder.py ADDED Viewed

	@@ -0,0 +1,338 @@

+import torch
+import torch.nn.functional as F
+import numpy as np
+import esm
+from tqdm import tqdm
+import os
+from datetime import datetime
+class EmbeddingToSequenceConverter:
+    """
+    Convert ESM embeddings back to amino acid sequences using real ESM2 token embeddings.
+    """
+    def __init__(self, device='cuda'):
+        self.device = device
+        # Load ESM model
+        print("Loading ESM model for sequence decoding...")
+        self.model, self.alphabet = esm.pretrained.esm2_t33_650M_UR50D()
+        self.model = self.model.to(device)
+        self.model.eval()
+        # Get vocabulary
+        self.vocab = self.alphabet.standard_toks
+        self.vocab_list = [token for token in self.vocab if token not in ['<cls>', '<eos>', '<unk>', '<pad>', '<mask>']]
+        # Pre-compute token embeddings for nearest neighbor search
+        self._precompute_token_embeddings()
+        print("✓ ESM model loaded for sequence decoding")
+    def _precompute_token_embeddings(self):
+        """
+        Pre-compute embeddings for all tokens in the vocabulary using real ESM2 embeddings.
+        """
+        print("Pre-computing token embeddings from ESM2 model...")
+        # Use standard amino acids
+        standard_aas = 'ACDEFGHIKLMNPQRSTVWY'
+        self.token_list = list(standard_aas)
+        # Extract real embeddings from ESM2 model
+        with torch.no_grad():
+            # Get token indices for each amino acid
+            aa_tokens = []
+            for aa in standard_aas:
+                try:
+                    token_idx = self.alphabet.get_idx(aa)
+                    aa_tokens.append(token_idx)
+                except:
+                    print(f"Warning: Could not find token for amino acid {aa}")
+                    # Fallback to a default token
+                    aa_tokens.append(0)
+            # Convert to tensor
+            aa_tokens = torch.tensor(aa_tokens, device=self.device)
+            # Extract embeddings from ESM2's embedding layer
+            # Note: ESM2 uses a different embedding structure, so we'll use the model's forward pass
+            # Create dummy sequences for each amino acid
+            dummy_sequences = [(f"aa_{i}", aa) for i, aa in enumerate(standard_aas)]
+            # Get embeddings using the same method as the encoder
+            converter = self.alphabet.get_batch_converter()
+            _, _, tokens = converter(dummy_sequences)
+            tokens = tokens.to(self.device)
+            # Get embeddings from layer 33 (same as encoder)
+            with torch.no_grad():
+                out = self.model(tokens, repr_layers=[33], return_contacts=False)
+                reps = out['representations'][33]  # [B, L+2, D]
+                # Extract per-residue embeddings (remove CLS and EOS tokens)
+                token_embeddings = []
+                for i, (_, seq) in enumerate(dummy_sequences):
+                    L = len(seq)
+                    emb = reps[i, 1:1+L, :]  # Remove CLS and EOS tokens
+                    # Take the first position embedding for each amino acid
+                    token_embeddings.append(emb[0])
+                self.token_embeddings = torch.stack(token_embeddings)
+        print(f"✓ Pre-computed embeddings for {len(self.token_embeddings)} tokens")
+        print(f"  Embedding shape: {self.token_embeddings.shape}")
+    def embedding_to_sequence(self, embedding, method='diverse', temperature=0.5):
+        """
+        Convert a single embedding back to amino acid sequence.
+        Args:
+            embedding: [seq_len, embed_dim] tensor
+            method: 'diverse', 'nearest_neighbor', or 'random'
+            temperature: Temperature for diverse sampling (lower = more diverse)
+        Returns:
+            sequence: string of amino acids
+        """
+        if method == 'diverse':
+            return self._diverse_decode(embedding, temperature)
+        elif method == 'nearest_neighbor':
+            return self._nearest_neighbor_decode(embedding)
+        elif method == 'random':
+            return self._random_decode(embedding)
+        else:
+            raise ValueError(f"Unknown method: {method}")
+    def _diverse_decode(self, embedding, temperature=0.5):
+        """
+        Decode using diverse sampling with temperature control.
+        """
+        # Ensure both tensors are on the same device
+        embedding = embedding.to(self.device)
+        token_embeddings = self.token_embeddings.to(self.device)
+        # Compute cosine similarity between embedding and all token embeddings
+        embedding_norm = F.normalize(embedding, dim=-1)  # [seq_len, embed_dim]
+        token_embeddings_norm = F.normalize(token_embeddings, dim=-1)  # [vocab_size, embed_dim]
+        # Compute similarities
+        similarities = torch.mm(embedding_norm, token_embeddings_norm.t())  # [seq_len, vocab_size]
+        # Apply temperature to increase diversity
+        similarities = similarities / temperature
+        # Convert to probabilities
+        probs = F.softmax(similarities, dim=-1)
+        # Sample from the distribution
+        sampled_indices = torch.multinomial(probs, 1).squeeze(-1)
+        # Convert to sequence
+        sequence = ''.join([self.token_list[idx] for idx in sampled_indices.cpu().numpy()])
+        return sequence
+    def _nearest_neighbor_decode(self, embedding):
+        """
+        Decode using nearest neighbor search in token embedding space.
+        """
+        # Ensure both tensors are on the same device
+        embedding = embedding.to(self.device)
+        token_embeddings = self.token_embeddings.to(self.device)
+        # Compute cosine similarity between embedding and all token embeddings
+        embedding_norm = F.normalize(embedding, dim=-1)  # [seq_len, embed_dim]
+        token_embeddings_norm = F.normalize(token_embeddings, dim=-1)  # [vocab_size, embed_dim]
+        # Compute similarities
+        similarities = torch.mm(embedding_norm, token_embeddings_norm.t())  # [seq_len, vocab_size]
+        # Find nearest neighbors
+        nearest_indices = torch.argmax(similarities, dim=-1)  # [seq_len]
+        # Convert to sequence
+        sequence = ''.join([self.token_list[idx] for idx in nearest_indices.cpu().numpy()])
+        return sequence
+    def _random_decode(self, embedding):
+        """
+        Decode using random sampling (fallback method).
+        """
+        seq_len = embedding.shape[0]
+        sequence = ''.join(np.random.choice(self.token_list, seq_len))
+        return sequence
+    def batch_embedding_to_sequences(self, embeddings, method='diverse', temperature=0.5):
+        """
+        Convert batch of embeddings to sequences.
+        Args:
+            embeddings: [batch_size, seq_len, embed_dim] tensor
+            method: decoding method
+            temperature: Temperature for diverse sampling
+        Returns:
+            sequences: list of strings
+        """
+        sequences = []
+        for i in tqdm(range(len(embeddings)), desc="Converting embeddings to sequences"):
+            embedding = embeddings[i]
+            sequence = self.embedding_to_sequence(embedding, method=method, temperature=temperature)
+            sequences.append(sequence)
+        return sequences
+    def validate_sequence(self, sequence):
+        """
+        Validate if a sequence contains valid amino acids.
+        """
+        valid_aas = set('ACDEFGHIKLMNPQRSTVWY')
+        return all(aa in valid_aas for aa in sequence)
+    def filter_valid_sequences(self, sequences):
+        """
+        Filter out sequences with invalid amino acids.
+        """
+        valid_sequences = []
+        for seq in sequences:
+            if self.validate_sequence(seq):
+                valid_sequences.append(seq)
+            else:
+                print(f"Warning: Invalid sequence found: {seq}")
+        return valid_sequences
+def main():
+    """
+    Decode all CFG-generated peptide embeddings to sequences and analyze distribution.
+    Uses the best trained model (loss: 0.017183, step: 53).
+    """
+    print("=== CFG-Generated Peptide Sequence Decoder (Best Model) ===")
+    # Initialize converter
+    converter = EmbeddingToSequenceConverter()
+    # Get today's date for filename
+    today = datetime.now().strftime('%Y%m%d')
+    # Load all CFG-generated embeddings (using best model)
+    cfg_files = {
+        'No CFG (0.0)': f'/data2/edwardsun/generated_samples/generated_amps_best_model_no_cfg_{today}.pt',
+        'Weak CFG (3.0)': f'/data2/edwardsun/generated_samples/generated_amps_best_model_weak_cfg_{today}.pt',
+        'Strong CFG (7.5)': f'/data2/edwardsun/generated_samples/generated_amps_best_model_strong_cfg_{today}.pt',
+        'Very Strong CFG (15.0)': f'/data2/edwardsun/generated_samples/generated_amps_best_model_very_strong_cfg_{today}.pt'
+    }
+    all_results = {}
+    for cfg_name, file_path in cfg_files.items():
+        print(f"\n{'='*50}")
+        print(f"Processing {cfg_name}...")
+        print(f"Loading: {file_path}")
+        try:
+            # Load embeddings
+            embeddings = torch.load(file_path, map_location='cpu')
+            print(f"✓ Loaded {len(embeddings)} embeddings, shape: {embeddings.shape}")
+            # Decode to sequences using diverse method
+            print(f"Decoding sequences...")
+            sequences = converter.batch_embedding_to_sequences(embeddings, method='diverse', temperature=0.5)
+            # Filter valid sequences
+            valid_sequences = converter.filter_valid_sequences(sequences)
+            print(f"✓ Valid sequences: {len(valid_sequences)}/{len(sequences)}")
+            # Store results
+            all_results[cfg_name] = {
+                'sequences': valid_sequences,
+                'total': len(sequences),
+                'valid': len(valid_sequences)
+            }
+            # Show sample sequences
+            print(f"\nSample sequences ({cfg_name}):")
+            for i, seq in enumerate(valid_sequences[:5]):
+                print(f"  {i+1}: {seq}")
+        except Exception as e:
+            print(f"❌ Error processing {file_path}: {e}")
+            all_results[cfg_name] = {'sequences': [], 'total': 0, 'valid': 0}
+    # Analysis and comparison
+    print(f"\n{'='*60}")
+    print("CFG ANALYSIS SUMMARY")
+    print(f"{'='*60}")
+    for cfg_name, results in all_results.items():
+        sequences = results['sequences']
+        if sequences:
+            # Calculate sequence statistics
+            lengths = [len(seq) for seq in sequences]
+            avg_length = np.mean(lengths)
+            std_length = np.std(lengths)
+            # Calculate amino acid composition
+            all_aas = ''.join(sequences)
+            aa_counts = {}
+            for aa in 'ACDEFGHIKLMNPQRSTVWY':
+                aa_counts[aa] = all_aas.count(aa)
+            # Calculate diversity (unique sequences)
+            unique_sequences = len(set(sequences))
+            diversity_ratio = unique_sequences / len(sequences)
+            print(f"\n{cfg_name}:")
+            print(f"  Total sequences: {results['total']}")
+            print(f"  Valid sequences: {results['valid']}")
+            print(f"  Unique sequences: {unique_sequences}")
+            print(f"  Diversity ratio: {diversity_ratio:.3f}")
+            print(f"  Avg length: {avg_length:.1f} ± {std_length:.1f}")
+            print(f"  Length range: {min(lengths)}-{max(lengths)}")
+            # Show top amino acids
+            sorted_aas = sorted(aa_counts.items(), key=lambda x: x[1], reverse=True)
+            print(f"  Top 5 AAs: {', '.join([f'{aa}({count})' for aa, count in sorted_aas[:5]])}")
+            # Create output directory if it doesn't exist
+            output_dir = '/data2/edwardsun/decoded_sequences'
+            os.makedirs(output_dir, exist_ok=True)
+            # Save sequences to file with date
+            output_file = os.path.join(output_dir, f"decoded_sequences_{cfg_name.lower().replace(' ', '_').replace('(', '').replace(')', '').replace('.', '')}_{today}.txt")
+            with open(output_file, 'w') as f:
+                f.write(f"# Decoded sequences from {cfg_name}\n")
+                f.write(f"# Total: {results['total']}, Valid: {results['valid']}, Unique: {unique_sequences}\n")
+                f.write(f"# Generated from best model (loss: 0.017183, step: 53)\n\n")
+                for i, seq in enumerate(sequences):
+                    f.write(f"seq_{i+1:03d}\t{seq}\n")
+            print(f"  ✓ Saved to: {output_file}")
+    # Overall comparison
+    print(f"\n{'='*60}")
+    print("OVERALL COMPARISON")
+    print(f"{'='*60}")
+    cfg_names = list(all_results.keys())
+    valid_counts = [all_results[name]['valid'] for name in cfg_names]
+    unique_counts = [len(set(all_results[name]['sequences'])) for name in cfg_names]
+    print(f"Valid sequences: {dict(zip(cfg_names, valid_counts))}")
+    print(f"Unique sequences: {dict(zip(cfg_names, unique_counts))}")
+    # Find most diverse and most similar
+    if all(valid_counts):
+        diversity_ratios = [unique_counts[i]/valid_counts[i] for i in range(len(valid_counts))]
+        most_diverse = cfg_names[diversity_ratios.index(max(diversity_ratios))]
+        least_diverse = cfg_names[diversity_ratios.index(min(diversity_ratios))]
+        print(f"\nMost diverse: {most_diverse} (ratio: {max(diversity_ratios):.3f})")
+        print(f"Least diverse: {least_diverse} (ratio: {min(diversity_ratios):.3f})")
+    print(f"\n✓ Decoding complete! Check the output files for detailed sequences.")
+if __name__ == "__main__":
+    main()

final_sequence_encoder.py ADDED Viewed

	@@ -0,0 +1,215 @@

+import json
+import os
+import torch
+import torch.nn.functional as F
+import esm
+from tqdm import tqdm
+import numpy as np
+# ---------------- Configuration ----------------
+DEVICE       = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+BATCH_SIZE   = 32    # increased for GPU efficiency
+MAX_SEQ_LEN  = 50   # max sequence length for AMPs
+MIN_SEQ_LEN  = 2     # minimum length for filtering
+CANONICAL_AA = set('ACDEFGHIKLMNPQRSTVWY')
+print(f"Using device: {DEVICE}")
+if torch.cuda.is_available():
+    print(f"GPU: {torch.cuda.get_device_name()}")
+    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
+# ---------------- Sequence Loading ----------------
+def read_peptides_json(json_file):
+    """
+    Read and filter sequences from the all_peptides_data.json file.
+    Extracts sequences from both main peptides and their monomers.
+    Filters:
+      - Only canonical 20 AAs
+      - Sequence length between MIN_SEQ_LEN and MAX_SEQ_LEN
+      - Non-empty sequences
+    Returns:
+      List of (seq_id, sequence) tuples.
+    """
+    print(f"Loading peptides from {json_file}...")
+    with open(json_file, 'r') as f:
+        data = json.load(f)
+    seqs = []
+    processed_ids = set()
+    for item in tqdm(data, desc="Processing peptides"):
+        # Process main peptide sequence
+        if 'sequence' in item and item['sequence']:
+            seq = item['sequence'].upper().strip()
+            if (MIN_SEQ_LEN <= len(seq) <= MAX_SEQ_LEN and
+                all(aa in CANONICAL_AA for aa in seq)):
+                seq_id = f"main_{item.get('id', 'unk')}"
+                if seq_id not in processed_ids:
+                    seqs.append((seq_id, seq))
+                    processed_ids.add(seq_id)
+        # Process monomer sequences
+        if 'monomers' in item and item['monomers']:
+            for monomer in item['monomers']:
+                if 'sequence' in monomer and monomer['sequence']:
+                    seq = monomer['sequence'].upper().strip()
+                    if (MIN_SEQ_LEN <= len(seq) <= MAX_SEQ_LEN and
+                        all(aa in CANONICAL_AA for aa in seq)):
+                        seq_id = f"monomer_{monomer.get('id', 'unk')}"
+                        if seq_id not in processed_ids:
+                            seqs.append((seq_id, seq))
+                            processed_ids.add(seq_id)
+    print(f"Found {len(seqs)} valid sequences")
+    return seqs
+@torch.no_grad()
+def get_per_residue_embeddings(model, alphabet, sequences, batch_size=BATCH_SIZE):
+    """
+    Compute per-residue ESM-2 embeddings for a list of (id, seq).
+    Pads or truncates each embedding to shape [MAX_SEQ_LEN, D].
+    Returns a dict {seq_id: tensor[MAX_SEQ_LEN, D]} on CPU.
+    """
+    model.eval()
+    converter = alphabet.get_batch_converter()
+    embeddings = {}
+    print(f"Computing embeddings for {len(sequences)} sequences...")
+    for i in tqdm(range(0, len(sequences), batch_size), desc="Computing embeddings"):
+        batch = sequences[i:i+batch_size]
+        labels, seqs = zip(*batch)
+        _, _, tokens = converter(batch)
+        tokens = tokens.to(DEVICE)
+        out = model(tokens, repr_layers=[33], return_contacts=False)
+        reps = out['representations'][33]  # [B, L+2, D]
+        for idx, sid in enumerate(labels):
+            seq = seqs[idx]
+            L = len(seq)
+            # take per-residue embeddings and pad/truncate
+            emb = reps[idx, 1:1+L, :]  # Remove CLS and EOS tokens
+            if L < MAX_SEQ_LEN:
+                pad_len = MAX_SEQ_LEN - L
+                emb = F.pad(emb, (0, 0, 0, pad_len))
+            elif L > MAX_SEQ_LEN:
+                emb = emb[:MAX_SEQ_LEN, :]
+            embeddings[sid] = emb.cpu()
+    return embeddings
+def save_embeddings_for_compressor(embeddings, output_dir="/data2/edwardsun/flow_project/peptide_embeddings"):
+    """
+    Save embeddings in a format compatible with the compressor.
+    Creates both individual files and a combined tensor.
+    """
+    os.makedirs(output_dir, exist_ok=True)
+    # Save individual embeddings
+    print(f"Saving individual embeddings to {output_dir}/...")
+    for seq_id, emb in tqdm(embeddings.items(), desc="Saving individual files"):
+        torch.save(emb, os.path.join(output_dir, f"{seq_id}.pt"))
+    # Create and save combined tensor for compressor
+    print("Creating combined tensor...")
+    all_embeddings = []
+    seq_ids = []
+    for seq_id, emb in embeddings.items():
+        all_embeddings.append(emb)
+        seq_ids.append(seq_id)
+    # Stack all embeddings
+    combined_embeddings = torch.stack(all_embeddings)  # [N, MAX_SEQ_LEN, D]
+    # Save combined tensor
+    combined_path = os.path.join(output_dir, "all_peptide_embeddings.pt")
+    torch.save(combined_embeddings, combined_path)
+    # Save sequence IDs for reference
+    seq_ids_path = os.path.join(output_dir, "sequence_ids.json")
+    with open(seq_ids_path, 'w') as f:
+        json.dump(seq_ids, f, indent=2)
+    # Save metadata
+    metadata = {
+        "num_sequences": len(embeddings),
+        "embedding_dim": combined_embeddings.shape[-1],
+        "max_seq_len": MAX_SEQ_LEN,
+        "device_used": str(DEVICE),
+        "model_name": "esm2_t33_650M_UR50D"
+    }
+    metadata_path = os.path.join(output_dir, "metadata.json")
+    with open(metadata_path, 'w') as f:
+        json.dump(metadata, f, indent=2)
+    print(f"Saved combined embeddings: {combined_path}")
+    print(f"Combined tensor shape: {combined_embeddings.shape}")
+    print(f"Memory usage: {combined_embeddings.element_size() * combined_embeddings.nelement() / 1e6:.1f} MB")
+    return combined_path
+def create_compressor_dataset(embeddings, output_dir="/data2/edwardsun/flow_project/compressor_dataset"):
+    """
+    Create a dataset format specifically for the compressor training.
+    """
+    os.makedirs(output_dir, exist_ok=True)
+    # Stack all embeddings
+    all_embeddings = torch.stack(list(embeddings.values()))
+    # Save as numpy array for easy loading
+    np_path = os.path.join(output_dir, "peptide_embeddings.npy")
+    np.save(np_path, all_embeddings.numpy())
+    # Save as torch tensor
+    torch_path = os.path.join(output_dir, "peptide_embeddings.pt")
+    torch.save(all_embeddings, torch_path)
+    print(f"Created compressor dataset:")
+    print(f"  Shape: {all_embeddings.shape}")
+    print(f"  Numpy: {np_path}")
+    print(f"  Torch: {torch_path}")
+    return torch_path
+# ---------------- Main Execution ----------------
+if __name__ == '__main__':
+    # 1. Load model & tokenizer
+    print("Loading ESM-2 model...")
+    model_name = 'esm2_t33_650M_UR50D'
+    model, alphabet = esm.pretrained.load_model_and_alphabet(model_name)
+    model = model.to(DEVICE)
+    print(f"Loaded {model_name}")
+    # 2. Read and filter sequences from peptides JSON
+    json_file = 'all_peptides_data.json'
+    sequences = read_peptides_json(json_file)
+    print(f"Loaded {len(sequences)} valid sequences from {json_file}")
+    if len(sequences) == 0:
+        print("No valid sequences found. Exiting.")
+        exit(1)
+    # 3. Compute per-residue embeddings
+    embeddings = get_per_residue_embeddings(model, alphabet, sequences)
+    # 4. Save embeddings in multiple formats
+    print("\nSaving embeddings...")
+    # Save individual files and combined tensor
+    combined_path = save_embeddings_for_compressor(embeddings)
+    # Create compressor-specific dataset
+    compressor_path = create_compressor_dataset(embeddings)
+    print(f"\n✓ Successfully processed {len(embeddings)} peptide sequences")
+    print(f"✓ Embeddings saved and ready for compressor training")
+    print(f"✓ Use '{compressor_path}' in your compressor.py file")
+    # Show some statistics
+    sample_emb = next(iter(embeddings.values()))
+    print(f"\nEmbedding statistics:")
+    print(f"  Individual embedding shape: {sample_emb.shape}")
+    print(f"  Embedding dimension: {sample_emb.shape[-1]}")
+    print(f"  Data type: {sample_emb.dtype}")

generate_amps.py ADDED Viewed

	@@ -0,0 +1,215 @@

+import torch
+import torch.nn.functional as F
+import numpy as np
+from tqdm import tqdm
+import os
+from datetime import datetime
+# Import your components
+from compressor_with_embeddings import Compressor, Decompressor
+from final_flow_model import AMPFlowMatcherCFGConcat, AMPProtFlowPipelineCFG
+class AMPGenerator:
+    """
+    Generate AMP samples using trained ProtFlow model.
+    """
+    def __init__(self, model_path, device='cuda'):
+        self.device = device
+        # Load models
+        self._load_models(model_path)
+        # Load preprocessing statistics
+        self.stats = torch.load('normalization_stats.pt', map_location=device)
+    def _load_models(self, model_path):
+        """Load trained models."""
+        print("Loading trained models...")
+        # Load compressor and decompressor
+        self.compressor = Compressor().to(self.device)
+        self.decompressor = Decompressor().to(self.device)
+        self.compressor.load_state_dict(torch.load('/data2/edwardsun/flow_amp/models/final_compressor_model.pth', map_location=self.device))
+        self.decompressor.load_state_dict(torch.load('/data2/edwardsun/flow_amp/models/final_decompressor_model.pth', map_location=self.device))
+        # Load flow matching model with CFG
+        self.flow_model = AMPFlowMatcherCFGConcat(
+            hidden_dim=480,
+            compressed_dim=80,  # 1280 // 16
+            n_layers=12,
+            n_heads=16,
+            dim_ff=3072,
+            max_seq_len=25,
+            use_cfg=True
+        ).to(self.device)
+        checkpoint = torch.load(model_path, map_location=self.device)
+        # Handle PyTorch compilation wrapper
+        state_dict = checkpoint['flow_model_state_dict']
+        new_state_dict = {}
+        for key, value in state_dict.items():
+            # Remove _orig_mod prefix if present
+            if key.startswith('_orig_mod.'):
+                new_key = key[10:]  # Remove '_orig_mod.' prefix
+            else:
+                new_key = key
+            new_state_dict[new_key] = value
+        self.flow_model.load_state_dict(new_state_dict)
+        print(f"✓ All models loaded successfully from step {checkpoint['step']}!")
+        print(f"  Loss at checkpoint: {checkpoint['loss']:.6f}")
+    def generate_amps(self, num_samples=100, num_steps=25, batch_size=32, cfg_scale=7.5):
+        """
+        Generate AMP samples using flow matching with CFG.
+        Args:
+            num_samples: Number of AMP samples to generate
+            num_steps: Number of ODE solving steps (25 for good quality, 1 for reflow)
+            batch_size: Batch size for generation
+            cfg_scale: CFG guidance scale (higher = stronger conditioning)
+        """
+        print(f"Generating {num_samples} AMP samples with {num_steps} steps (CFG scale: {cfg_scale})...")
+        self.flow_model.eval()
+        self.compressor.eval()
+        self.decompressor.eval()
+        all_generated = []
+        with torch.no_grad():
+            for i in tqdm(range(0, num_samples, batch_size), desc="Generating"):
+                current_batch = min(batch_size, num_samples - i)
+                # Sample random noise
+                eps = torch.randn(current_batch, 25, 80, device=self.device)  # [B, L', COMP_DIM]
+                # ODE solving steps with CFG
+                xt = eps.clone()
+                amp_labels = torch.full((current_batch,), 0, device=self.device)  # 0 = AMP
+                mask_labels = torch.full((current_batch,), 2, device=self.device)  # 2 = Mask
+                for step in range(num_steps):
+                    t = torch.ones(current_batch, device=self.device) * (1.0 - step/num_steps)
+                    # CFG: Generate with condition and without condition
+                    if cfg_scale > 0:
+                        # With AMP condition
+                        vt_cond = self.flow_model(xt, t, labels=amp_labels)
+                        # Without condition (mask)
+                        vt_uncond = self.flow_model(xt, t, labels=mask_labels)
+                        # CFG interpolation
+                        vt = vt_uncond + cfg_scale * (vt_cond - vt_uncond)
+                    else:
+                        # No CFG, use mask label
+                        vt = self.flow_model(xt, t, labels=mask_labels)
+                    # Euler step for backward integration (t: 1 -> 0)
+                    # Use negative dt to integrate backward from noise to data
+                    dt = -1.0 / num_steps
+                    xt = xt + vt * dt
+                # Decompress to get embeddings
+                decompressed = self.decompressor(xt)  # [B, L, ESM_DIM]
+                # Apply reverse preprocessing
+                m, s, mn, mx = self.stats['mean'], self.stats['std'], self.stats['min'], self.stats['max']
+                decompressed = decompressed * (mx - mn + 1e-8) + mn
+                decompressed = decompressed * s + m
+                all_generated.append(decompressed.cpu())
+        # Concatenate all batches
+        generated_embeddings = torch.cat(all_generated, dim=0)
+        print(f"✓ Generated {generated_embeddings.shape[0]} AMP embeddings")
+        print(f"  Shape: {generated_embeddings.shape}")
+        print(f"  Stats - Mean: {generated_embeddings.mean():.4f}, Std: {generated_embeddings.std():.4f}")
+        return generated_embeddings
+    def generate_with_reflow(self, num_samples=100):
+        """
+        Generate AMP samples using 1-step reflow (if you have reflow model).
+        """
+        print(f"Generating {num_samples} AMP samples with 1-step reflow...")
+        # This would use the reflow implementation
+        # For now, just use 1-step generation
+        return self.generate_amps(num_samples=num_samples, num_steps=1, batch_size=32)
+def main():
+    """Main generation function."""
+    print("=== AMP Generation Pipeline with CFG ===")
+    # Use the best model from training
+    model_path = '/data2/edwardsun/flow_amp/checkpoints/amp_flow_model_best_optimized.pth'
+    # Check if checkpoint exists
+    try:
+        checkpoint = torch.load(model_path, map_location='cpu')
+        print(f"✓ Found best model at step {checkpoint['step']} with loss {checkpoint['loss']:.6f}")
+        print(f"  Global step: {checkpoint['global_step']}")
+        print(f"  Total samples: {checkpoint['total_samples']:,}")
+    except:
+        print(f"❌ Best model not found: {model_path}")
+        print("Please train the flow matching model first using amp_flow_training.py")
+        return
+    # Initialize generator
+    generator = AMPGenerator(model_path, device='cuda')
+    # Generate samples with different CFG scales
+    print("\n1. Generating with CFG scale 0.0 (no conditioning)...")
+    samples_no_cfg = generator.generate_amps(num_samples=20, num_steps=25, cfg_scale=0.0)
+    print("\n2. Generating with CFG scale 3.0 (weak conditioning)...")
+    samples_weak_cfg = generator.generate_amps(num_samples=20, num_steps=25, cfg_scale=3.0)
+    print("\n3. Generating with CFG scale 7.5 (strong conditioning)...")
+    samples_strong_cfg = generator.generate_amps(num_samples=20, num_steps=25, cfg_scale=7.5)
+    print("\n4. Generating with CFG scale 15.0 (very strong conditioning)...")
+    samples_very_strong_cfg = generator.generate_amps(num_samples=20, num_steps=25, cfg_scale=15.0)
+    # Create output directory if it doesn't exist
+    output_dir = '/data2/edwardsun/generated_samples'
+    os.makedirs(output_dir, exist_ok=True)
+    # Get today's date for filename
+    today = datetime.now().strftime('%Y%m%d')
+    # Save generated samples with date
+    torch.save(samples_no_cfg, os.path.join(output_dir, f'generated_amps_best_model_no_cfg_{today}.pt'))
+    torch.save(samples_weak_cfg, os.path.join(output_dir, f'generated_amps_best_model_weak_cfg_{today}.pt'))
+    torch.save(samples_strong_cfg, os.path.join(output_dir, f'generated_amps_best_model_strong_cfg_{today}.pt'))
+    torch.save(samples_very_strong_cfg, os.path.join(output_dir, f'generated_amps_best_model_very_strong_cfg_{today}.pt'))
+    print("\n✓ Generation complete!")
+    print(f"Generated samples saved (Date: {today}):")
+    print(f"  - generated_amps_best_model_no_cfg_{today}.pt (no conditioning)")
+    print(f"  - generated_amps_best_model_weak_cfg_{today}.pt (weak CFG)")
+    print(f"  - generated_amps_best_model_strong_cfg_{today}.pt (strong CFG)")
+    print(f"  - generated_amps_best_model_very_strong_cfg_{today}.pt (very strong CFG)")
+    print("\nCFG Analysis:")
+    print("  - CFG scale 0.0: No conditioning, generates diverse sequences")
+    print("  - CFG scale 3.0: Weak AMP conditioning")
+    print("  - CFG scale 7.5: Strong AMP conditioning (recommended)")
+    print("  - CFG scale 15.0: Very strong AMP conditioning (may be too restrictive)")
+    print("\nNext steps:")
+    print("1. Decode embeddings back to sequences using ESM-2 decoder")
+    print("2. Evaluate AMP properties (antimicrobial activity, toxicity)")
+    print("3. Compare sequences generated with different CFG scales")
+    print("4. Implement conditioning for specific properties")
+if __name__ == "__main__":
+    main()

launch_full_data_training.sh ADDED Viewed

	@@ -0,0 +1,118 @@

+#!/bin/bash
+# Optimized Single GPU AMP Flow Matching Training Launch Script with FULL DATA
+# This script launches optimized training on GPU 3 using ALL available data
+# Features: Mixed precision (BF16), increased batch size, H100 optimizations
+echo "=== Launching Optimized Single GPU AMP Flow Matching Training with FULL DATA ==="
+echo "Using GPU 3 for training (other GPUs are busy)"
+echo "Using ALL available peptide embeddings and UniProt data"
+echo "OVERNIGHT TRAINING: 15000 iterations with CFG support and H100 optimizations"
+echo ""
+# Check if required files exist
+echo "Checking required files..."
+if [ ! -f "final_compressor_model.pth" ]; then
+    echo "❌ Missing final_compressor_model.pth"
+    echo "Please run compressor_with_embeddings.py first"
+    exit 1
+fi
+if [ ! -f "final_decompressor_model.pth" ]; then
+    echo "❌ Missing final_decompressor_model.pth"
+    echo "Please run compressor_with_embeddings.py first"
+    exit 1
+fi
+if [ ! -d "/data2/edwardsun/flow_project/peptide_embeddings/" ]; then
+    echo "❌ Missing /data2/edwardsun/flow_project/peptide_embeddings/ directory"
+    echo "Please run final_sequence_encoder.py first"
+    exit 1
+fi
+# Check for full data files
+if [ ! -f "/data2/edwardsun/flow_project/peptide_embeddings/all_peptide_embeddings.pt" ]; then
+    echo "⚠️  Warning: all_peptide_embeddings.pt not found"
+    echo "Will use individual embedding files instead"
+else
+    echo "✓ Found all_peptide_embeddings.pt (4.3GB - ALL peptide data)"
+fi
+if [ ! -f "/data2/edwardsun/flow_project/test_uniprot_processed/uniprot_processed_data.json" ]; then
+    echo "❌ Missing /data2/edwardsun/flow_project/test_uniprot_processed/uniprot_processed_data.json"
+    echo "This contains ALL UniProt data for CFG training"
+    exit 1
+else
+    echo "✓ Found uniprot_processed_data.json (3.4GB - ALL UniProt data)"
+fi
+echo "✓ All required files found!"
+echo ""
+# Set CUDA device to GPU 3
+export CUDA_VISIBLE_DEVICES=3
+# Enable H100 optimizations
+export TORCH_CUDNN_V8_API_ENABLED=1
+export TORCH_CUDNN_V8_API_DISABLED=0
+echo "=== Optimized Training Configuration ==="
+echo "  - GPU: 3 (CUDA_VISIBLE_DEVICES=3)"
+echo "  - Batch size: 96 (optimized based on profiling)"
+echo "  - Total iterations: 6,000"
+echo "  - Mixed precision: BF16 (H100 optimized)"
+echo "  - Learning rate: 4e-4 -> 2e-4 (cosine annealing)"
+echo "  - Warmup steps: 5,000"
+echo "  - Gradient clipping: 1.0"
+echo "  - Weight decay: 0.01"
+echo "  - Data workers: 16"
+echo "  - CFG dropout: 15%"
+echo "  - Validation: Every 10,000 steps"
+echo "  - Checkpoints: Every 1,000 epochs"
+echo "  - Estimated time: ~8-10 hours (overnight training)"
+echo ""
+# Check GPU memory and capabilities
+echo "Checking GPU capabilities..."
+nvidia-smi --query-gpu=name,memory.total,memory.free --format=csv,noheader,nounits | while IFS=, read -r name total free; do
+    echo "  GPU: $name"
+    echo "  Total memory: ${total}MB"
+    echo "  Free memory: ${free}MB"
+    echo "  Available: $((free * 100 / total))%"
+done
+echo ""
+# Launch optimized training
+echo "Starting optimized single GPU training on GPU 3 with FULL DATA..."
+echo ""
+# Launch training with optional wandb logging
+# Uncomment the following line if you want to use wandb logging:
+# python amp_flow_training_single_gpu_full_data.py --use_wandb
+# Standard training without wandb
+python amp_flow_training_single_gpu_full_data.py
+echo ""
+echo "=== Optimized Overnight Training Complete with FULL DATA ==="
+echo "Check for output files:"
+echo "  - amp_flow_model_best_optimized.pth (best validation model)"
+echo "  - amp_flow_model_final_optimized.pth (final model)"
+echo "  - amp_flow_checkpoint_optimized_step_*.pth (checkpoints every 1000 epochs)"
+echo ""
+echo "Training optimizations applied:"
+echo "  ✓ Mixed precision (BF16) for ~30-50% speedup"
+echo "  ✓ Increased batch size (128) for better H100 utilization"
+echo "  ✓ Optimized learning rate schedule with proper warmup"
+echo "  ✓ Gradient clipping for training stability"
+echo "  ✓ CFG dropout for better guidance"
+echo "  ✓ Validation monitoring and early stopping"
+echo "  ✓ PyTorch 2.x compilation for speedup"
+echo ""
+echo "Next steps:"
+echo "1. Test the optimized model: python generate_amps.py"
+echo "2. Compare performance with previous model"
+echo "3. Implement reflow for 1-step generation"
+echo "4. Add conditioning for toxicity"
+echo "5. Fine-tune on specific AMP properties"

launch_multi_gpu_training.sh ADDED Viewed

	@@ -0,0 +1,85 @@

+#!/bin/bash
+# Multi-GPU AMP Flow Matching Training Launch Script
+# This script launches distributed training across 4 H100 GPUs
+echo "=== Launching Multi-GPU AMP Flow Matching Training with FULL DATA ==="
+echo "Using 4 H100 GPUs for distributed training"
+echo "Using ALL available peptide embeddings and UniProt data"
+echo "EXTENDED TRAINING: 5000 iterations with CFG support"
+echo ""
+# Check if required files exist
+echo "Checking required files..."
+if [ ! -f "final_compressor_model.pth" ]; then
+    echo "❌ Missing final_compressor_model.pth"
+    echo "Please run compressor_with_embeddings.py first"
+    exit 1
+fi
+if [ ! -f "final_decompressor_model.pth" ]; then
+    echo "❌ Missing final_decompressor_model.pth"
+    echo "Please run compressor_with_embeddings.py first"
+    exit 1
+fi
+if [ ! -d "/data2/edwardsun/flow_project/peptide_embeddings/" ]; then
+    echo "❌ Missing /data2/edwardsun/flow_project/peptide_embeddings/ directory"
+    echo "Please run final_sequence_encoder.py first"
+    exit 1
+fi
+# Check for full data files
+if [ ! -f "/data2/edwardsun/flow_project/peptide_embeddings/all_peptide_embeddings.pt" ]; then
+    echo "⚠️  Warning: all_peptide_embeddings.pt not found"
+    echo "Will use individual embedding files instead"
+else
+    echo "✓ Found all_peptide_embeddings.pt (4.3GB - ALL peptide data)"
+fi
+# Check if there are embedding files in the directory (fallback)
+if [ ! "$(ls -A /data2/edwardsun/flow_project/peptide_embeddings/*.pt 2>/dev/null)" ]; then
+    echo "❌ No .pt files found in /data2/edwardsun/flow_project/peptide_embeddings/ directory"
+    echo "Please run final_sequence_encoder.py first"
+    exit 1
+fi
+echo "✓ All required files found!"
+echo ""
+# Set environment variables for distributed training
+export NCCL_DEBUG=INFO
+export NCCL_IB_DISABLE=0
+export NCCL_P2P_DISABLE=0
+# Launch distributed training
+echo "Starting distributed training with torchrun..."
+echo "Configuration (FULL DATA TRAINING):"
+echo "  - Number of GPUs: 4"
+echo "  - Batch size per GPU: 64"
+echo "  - Total batch size: 256"
+echo "  - Total iterations: 5,000"
+echo "  - Data: ALL peptide embeddings + ALL UniProt data"
+echo "  - Estimated time: ~30-45 minutes (4x faster than single GPU)"
+echo ""
+# Launch with torchrun
+torchrun \
+    --nproc_per_node=4 \
+    --nnodes=1 \
+    --node_rank=0 \
+    --master_addr=localhost \
+    --master_port=29500 \
+    amp_flow_training_multi_gpu.py
+echo ""
+echo "=== Training Complete with FULL DATA ==="
+echo "Check for output files:"
+echo "  - amp_flow_model_final_full_data.pth (final model with full data)"
+echo "  - amp_flow_checkpoint_full_data_step_*.pth (checkpoints)"
+echo ""
+echo "Next steps:"
+echo "1. Test the model: python generate_amps.py"
+echo "2. If successful, increase iterations for full training"
+echo "3. Implement reflow for 1-step generation"
+echo "4. Add conditioning for toxicity"

model_card.md ADDED Viewed

	@@ -0,0 +1,127 @@

+---
+language:
+- en
+tags:
+- protein-design
+- antimicrobial-peptides
+- flow-matching
+- esm-2
+- pytorch
+license: mit
+datasets:
+- uniprot
+- amp-datasets
+metrics:
+- mic-prediction
+- sequence-validity
+- diversity
+---
+# FlowAMP: Flow-based Antimicrobial Peptide Generation
+## Model Description
+FlowAMP is a novel flow-based generative model for designing antimicrobial peptides (AMPs) using conditional flow matching and ESM-2 protein language model embeddings. The model leverages the power of flow matching for high-quality peptide generation while incorporating protein language model understanding for biologically relevant sequences.
+### Architecture
+The model consists of several key components:
+1. **ESM-2 Encoder**: Uses ESM-2 (esm2_t33_650M_UR50D) to extract 1280-dimensional protein sequence embeddings
+2. **Compressor/Decompressor**: Reduces embedding dimensionality by 16x (1280 → 80) for efficient processing
+3. **Flow Matcher**: Implements conditional flow matching for generation with time embeddings
+4. **CFG Integration**: Classifier-free guidance for controllable generation
+### Key Features
+- **Flow-based Generation**: Uses conditional flow matching for high-quality peptide generation
+- **ESM-2 Integration**: Leverages ESM-2 protein language model embeddings for sequence understanding
+- **CFG Training**: Implements Classifier-Free Guidance for controllable generation
+- **Multi-GPU Training**: Optimized for H100 GPUs with mixed precision training
+- **Comprehensive Evaluation**: MIC prediction and antimicrobial activity assessment
+## Training
+### Training Data
+The model was trained on:
+- **UniProt Database**: Comprehensive protein sequence database
+- **AMP Datasets**: Curated antimicrobial peptide sequences
+- **ESM-2 Embeddings**: Pre-computed embeddings for efficient training
+### Training Configuration
+- **Batch Size**: 96 (optimized for H100)
+- **Learning Rate**: 4e-4 with cosine annealing to 2e-4
+- **Epochs**: 6000
+- **Mixed Precision**: BF16 for H100 optimization
+- **CFG Dropout**: 15% for unconditional training
+- **Gradient Clipping**: Norm=1.0 for stability
+### Training Performance
+- **Speed**: 31 steps/second on H100 GPU
+- **Memory Efficiency**: Mixed precision training
+- **Stability**: Gradient clipping and weight decay (0.01)
+## Usage
+### Basic Generation
+```python
+from final_flow_model import AMPFlowMatcherCFGConcat
+from generate_amps import generate_amps
+# Load trained model
+model = AMPFlowMatcherCFGConcat.load_from_checkpoint('path/to/checkpoint.pth')
+# Generate AMPs with different CFG strengths
+sequences_no_cfg = generate_amps(model, num_samples=100, cfg_strength=0.0)
+sequences_weak_cfg = generate_amps(model, num_samples=100, cfg_strength=1.0)
+sequences_strong_cfg = generate_amps(model, num_samples=100, cfg_strength=2.0)
+sequences_very_strong_cfg = generate_amps(model, num_samples=100, cfg_strength=3.0)
+```
+### Evaluation
+```python
+from test_generated_peptides import evaluate_generated_peptides
+# Evaluate generated sequences for antimicrobial activity
+results = evaluate_generated_peptides(sequences)
+```
+## Performance
+### Generation Quality
+- **Sequence Validity**: High percentage of valid peptide sequences
+- **Diversity**: Good sequence diversity across different CFG strengths
+- **Biological Relevance**: ESM-2 embeddings ensure biologically meaningful sequences
+### Antimicrobial Activity
+- **MIC Prediction**: Integration with Apex model for MIC prediction
+- **Activity Assessment**: Comprehensive evaluation of antimicrobial potential
+- **CFG Effectiveness**: Measured through controlled generation
+## Limitations
+- **Sequence Length**: Limited to 50 amino acids maximum
+- **Computational Requirements**: Requires GPU for efficient generation
+- **Training Data**: Dependent on quality of UniProt and AMP datasets
+## Citation
+```bibtex
+@article{flowamp2024,
+  title={FlowAMP: Flow-based Antimicrobial Peptide Generation with Conditional Flow Matching},
+  author={Sun, Edward},
+  journal={arXiv preprint},
+  year={2024}
+}
+```
+## License
+MIT License - see LICENSE file for details.

monitor_training.sh ADDED Viewed

	@@ -0,0 +1,53 @@

+#!/bin/bash
+echo "=== AMP Flow Training Monitor ==="
+echo "Timestamp: $(date)"
+echo ""
+# Check if training process is running
+echo "1. Process Status:"
+if pgrep -f "amp_flow_training_single_gpu_full_data.py" > /dev/null; then
+    echo "✓ Training process is running"
+    PID=$(pgrep -f "amp_flow_training_single_gpu_full_data.py")
+    echo "  PID: $PID"
+    echo "  Runtime: $(ps -o etime= -p $PID)"
+else
+    echo "❌ Training process not found"
+    exit 1
+fi
+echo ""
+# Check GPU usage
+echo "2. GPU Usage:"
+nvidia-smi --query-gpu=index,name,utilization.gpu,memory.used,memory.total --format=csv,noheader,nounits | while IFS=, read -r idx name util mem_used mem_total; do
+    echo "  GPU $idx ($name): $util% | ${mem_used}MB/${mem_total}MB"
+done
+echo ""
+# Check log file
+echo "3. Recent Log Output:"
+if [ -f "overnight_training.log" ]; then
+    echo "  Log file size: $(du -h overnight_training.log | cut -f1)"
+    echo "  Last 5 lines:"
+    tail -5 overnight_training.log | sed 's/^/    /'
+else
+    echo "  ❌ Log file not found"
+fi
+echo ""
+# Check for checkpoint files
+echo "4. Checkpoint Files:"
+if [ -d "/data2/edwardsun/flow_checkpoints" ]; then
+    echo "  Checkpoint directory: /data2/edwardsun/flow_checkpoints"
+    ls -la /data2/edwardsun/flow_checkpoints/*.pth 2>/dev/null | wc -l | xargs echo "  Number of checkpoints:"
+    echo "  Latest checkpoint:"
+    ls -t /data2/edwardsun/flow_checkpoints/*.pth 2>/dev/null | head -1 | xargs -I {} basename {} 2>/dev/null || echo "    None yet"
+else
+    echo "  ❌ Checkpoint directory not found"
+fi
+echo ""
+echo "=== End Monitor ==="

normalization_stats.pt ADDED Viewed

Binary file (12.3 kB). View file

requirements.txt ADDED Viewed

	@@ -0,0 +1,9 @@

+torch>=2.0.0
+transformers>=4.20.0
+numpy>=1.21.0
+tqdm>=4.64.0
+wandb>=0.13.0
+pandas>=1.5.0
+scikit-learn>=1.1.0
+matplotlib>=3.5.0
+seaborn>=0.11.0

requirements.yaml ADDED Viewed

	@@ -0,0 +1,31 @@

+name: mdlm
+channels:
+  - pytorch
+  - conda-forge
+  - defaults
+dependencies:
+  - python=3.9
+  - pytorch=2.1.0
+  - torchvision
+  - torchaudio
+  - pytorch-cuda=11.8
+  - cudatoolkit=11.8
+  - pip
+  - pip:
+    - fair-esm
+    - transformers
+    - datasets
+    - accelerate
+    - wandb
+    - tqdm
+    - numpy
+    - scipy
+    - scikit-learn
+    - matplotlib
+    - seaborn
+    - pandas
+    - biopython
+    - h5py
+    - tensorboard
+    - jupyter
+    - ipykernel

test_generated_peptides.py ADDED Viewed

	@@ -0,0 +1,383 @@

+import torch
+import numpy as np
+import json
+import os
+from tqdm import tqdm
+import warnings
+from datetime import datetime
+warnings.filterwarnings('ignore')
+# Import our components
+from generate_amps import AMPGenerator
+from compressor_with_embeddings import Compressor, Decompressor
+from final_sequence_decoder import EmbeddingToSequenceConverter
+# Import local APEX wrapper
+try:
+    from local_apex_wrapper import LocalAPEXWrapper
+    APEX_AVAILABLE = True
+except ImportError as e:
+    print(f"Warning: Local APEX not available: {e}")
+    APEX_AVAILABLE = False
+class PeptideTester:
+    """
+    Generate peptides and test them using APEX for antimicrobial activity.
+    """
+    def __init__(self, model_path='amp_flow_model_final.pth', device='cuda'):
+        self.device = device
+        self.model_path = model_path
+        # Initialize generator
+        print("Initializing peptide generator...")
+        self.generator = AMPGenerator(model_path, device)
+        # Initialize embedding to sequence converter
+        print("Initializing embedding to sequence converter...")
+        self.converter = EmbeddingToSequenceConverter(device)
+        # Initialize APEX if available
+        if APEX_AVAILABLE:
+            print("Initializing local APEX predictor...")
+            self.apex = LocalAPEXWrapper()
+            print("✓ Local APEX loaded successfully!")
+        else:
+            self.apex = None
+            print("⚠ Local APEX not available - will only generate sequences")
+    def generate_peptides(self, num_samples=100, num_steps=25, batch_size=32):
+        """
+        Generate peptide sequences using the trained flow model.
+        """
+        print(f"\n=== Generating {num_samples} Peptide Sequences ===")
+        # Generate embeddings
+        generated_embeddings = self.generator.generate_amps(
+            num_samples=num_samples,
+            num_steps=num_steps,
+            batch_size=batch_size
+        )
+        print(f"Generated embeddings shape: {generated_embeddings.shape}")
+        # Convert embeddings to sequences using the converter
+        sequences = self.converter.batch_embedding_to_sequences(generated_embeddings)
+        # Filter valid sequences
+        sequences = self.converter.filter_valid_sequences(sequences)
+        return sequences
+    def test_with_apex(self, sequences):
+        """
+        Test generated sequences using APEX for antimicrobial activity.
+        """
+        if not APEX_AVAILABLE:
+            print("⚠ APEX not available - skipping activity prediction")
+            return None
+        print(f"\n=== Testing {len(sequences)} Sequences with APEX ===")
+        results = []
+        for i, seq in tqdm(enumerate(sequences), desc="Testing with APEX"):
+            try:
+                # Predict antimicrobial activity using local APEX
+                avg_mic = self.apex.predict_single(seq)
+                is_amp = self.apex.is_amp(seq, threshold=32.0)  # MIC threshold
+                result = {
+                    'sequence': seq,
+                    'sequence_id': f'generated_{i:04d}',
+                    'apex_score': avg_mic,  # Lower MIC = better activity
+                    'is_amp': is_amp,
+                    'length': len(seq)
+                }
+                results.append(result)
+            except Exception as e:
+                print(f"Error testing sequence {i}: {e}")
+                continue
+        return results
+    def analyze_results(self, results):
+        """
+        Analyze the results of APEX testing.
+        """
+        if not results:
+            print("No results to analyze")
+            return
+        print(f"\n=== Analysis of {len(results)} Generated Peptides ===")
+        # Extract scores
+        scores = [r['apex_score'] for r in results]
+        amp_count = sum(1 for r in results if r['is_amp'])
+        print(f"Total sequences tested: {len(results)}")
+        print(f"Predicted AMPs: {amp_count} ({amp_count/len(results)*100:.1f}%)")
+        print(f"Average MIC: {np.mean(scores):.2f} μg/mL")
+        print(f"MIC range: {np.min(scores):.2f} - {np.max(scores):.2f} μg/mL")
+        print(f"MIC std: {np.std(scores):.2f} μg/mL")
+        # Show top candidates
+        top_candidates = sorted(results, key=lambda x: x['apex_score'], reverse=True)[:10]
+        print(f"\n=== Top 10 Candidates ===")
+        for i, candidate in enumerate(top_candidates):
+                    print(f"{i+1:2d}. MIC: {candidate['apex_score']:.2f} μg/mL | "
+              f"Length: {candidate['length']:2d} | "
+              f"Sequence: {candidate['sequence']}")
+        return results
+    def save_results(self, results, filename='generated_peptides_results.json'):
+        """
+        Save results to JSON file.
+        """
+        if not results:
+            print("No results to save")
+            return
+        output = {
+            'metadata': {
+                'model_path': self.model_path,
+                'num_sequences': len(results),
+                'generation_timestamp': str(torch.cuda.Event() if torch.cuda.is_available() else 'cpu'),
+                'apex_available': APEX_AVAILABLE
+            },
+            'results': results
+        }
+        with open(filename, 'w') as f:
+            json.dump(output, f, indent=2)
+        print(f"✓ Results saved to {filename}")
+    def run_full_pipeline(self, num_samples=100, save_results=True):
+        """
+        Run the complete pipeline: generate peptides and test with APEX.
+        """
+        print("🚀 Starting Full Peptide Generation and Testing Pipeline")
+        print("=" * 60)
+        # Step 1: Generate peptides
+        sequences = self.generate_peptides(num_samples=num_samples)
+        # Step 2: Test with APEX
+        results = self.test_with_apex(sequences)
+        # Step 3: Analyze results
+        if results:
+            self.analyze_results(results)
+            # Step 4: Save results
+            if save_results:
+                self.save_results(results)
+        return results
+def main():
+    """
+    Main function to test existing decoded sequence files with APEX.
+    """
+    print("🧬 AMP Flow Model - Testing Decoded Sequences with APEX")
+    print("=" * 60)
+    # Check if APEX is available
+    if not APEX_AVAILABLE:
+        print("❌ Local APEX not available - cannot test sequences")
+        print("Please ensure local_apex_wrapper.py is properly set up")
+        return
+    # Initialize tester (we only need APEX, not the generator)
+    print("Initializing APEX predictor...")
+    apex = LocalAPEXWrapper()
+    print("✓ Local APEX loaded successfully!")
+    # Get today's date for filename
+    today = datetime.now().strftime('%Y%m%d')
+    # Define the decoded sequence files to test (using today's generated sequences)
+    cfg_files = {
+        'No CFG (0.0)': f'/data2/edwardsun/decoded_sequences/decoded_sequences_no_cfg_00_{today}.txt',
+        'Weak CFG (3.0)': f'/data2/edwardsun/decoded_sequences/decoded_sequences_weak_cfg_30_{today}.txt',
+        'Strong CFG (7.5)': f'/data2/edwardsun/decoded_sequences/decoded_sequences_strong_cfg_75_{today}.txt',
+        'Very Strong CFG (15.0)': f'/data2/edwardsun/decoded_sequences/decoded_sequences_very_strong_cfg_150_{today}.txt'
+    }
+    all_results = {}
+    for cfg_name, file_path in cfg_files.items():
+        print(f"\n{'='*60}")
+        print(f"Testing {cfg_name} sequences...")
+        print(f"Loading: {file_path}")
+        if not os.path.exists(file_path):
+            print(f"❌ File not found: {file_path}")
+            continue
+        # Read sequences from file
+        sequences = []
+        with open(file_path, 'r') as f:
+            for line in f:
+                line = line.strip()
+                if line and not line.startswith('#') and '\t' in line:
+                    # Parse sequence from tab-separated format
+                    parts = line.split('\t')
+                    if len(parts) >= 2:
+                        seq = parts[1].strip()
+                        if seq and len(seq) > 0:
+                            sequences.append(seq)
+        print(f"✓ Loaded {len(sequences)} sequences from {file_path}")
+        # Test sequences with APEX
+        results = []
+        print(f"Testing {len(sequences)} sequences with APEX...")
+        for i, seq in tqdm(enumerate(sequences), desc=f"Testing {cfg_name}"):
+            try:
+                # Predict antimicrobial activity using local APEX
+                avg_mic = apex.predict_single(seq)
+                is_amp = apex.is_amp(seq, threshold=32.0)  # MIC threshold
+                result = {
+                    'sequence': seq,
+                    'sequence_id': f'{cfg_name.lower().replace(" ", "_").replace("(", "").replace(")", "").replace(".", "")}_{i:03d}',
+                    'cfg_setting': cfg_name,
+                    'apex_score': avg_mic,  # Lower MIC = better activity
+                    'is_amp': is_amp,
+                    'length': len(seq)
+                }
+                results.append(result)
+            except Exception as e:
+                print(f"Warning: Error testing sequence {i}: {e}")
+                continue
+        # Analyze results for this CFG setting
+        if results:
+            print(f"\n=== Analysis of {cfg_name} ===")
+            scores = [r['apex_score'] for r in results]
+            amp_count = sum(1 for r in results if r['is_amp'])
+            print(f"Total sequences tested: {len(results)}")
+            print(f"Predicted AMPs: {amp_count} ({amp_count/len(results)*100:.1f}%)")
+            print(f"Average MIC: {np.mean(scores):.2f} μg/mL")
+            print(f"MIC range: {np.min(scores):.2f} - {np.max(scores):.2f} μg/mL")
+            print(f"MIC std: {np.std(scores):.2f} μg/mL")
+            # Show top 5 candidates for this CFG setting
+            top_candidates = sorted(results, key=lambda x: x['apex_score'])[:5]  # Lower MIC is better
+            print(f"\n=== Top 5 Candidates ({cfg_name}) ===")
+            for i, candidate in enumerate(top_candidates):
+                print(f"{i+1:2d}. MIC: {candidate['apex_score']:.2f} μg/mL | "
+                      f"Length: {candidate['length']:2d} | "
+                      f"Sequence: {candidate['sequence']}")
+            all_results[cfg_name] = results
+            # Create output directory if it doesn't exist
+            output_dir = '/data2/edwardsun/apex_results'
+            os.makedirs(output_dir, exist_ok=True)
+            # Save individual results with date
+            output_file = os.path.join(output_dir, f"apex_results_{cfg_name.lower().replace(' ', '_').replace('(', '').replace(')', '').replace('.', '')}_{today}.json")
+            with open(output_file, 'w') as f:
+                json.dump({
+                    'metadata': {
+                        'cfg_setting': cfg_name,
+                        'num_sequences': len(results),
+                        'apex_available': APEX_AVAILABLE
+                    },
+                    'results': results
+                }, f, indent=2)
+            print(f"✓ Results saved to {output_file}")
+    # Overall comparison
+    print(f"\n{'='*60}")
+    print("OVERALL COMPARISON ACROSS CFG SETTINGS")
+    print(f"{'='*60}")
+    for cfg_name, results in all_results.items():
+        if results:
+            scores = [r['apex_score'] for r in results]
+            amp_count = sum(1 for r in results if r['is_amp'])
+            print(f"\n{cfg_name}:")
+            print(f"  Total: {len(results)} | AMPs: {amp_count} ({amp_count/len(results)*100:.1f}%)")
+            print(f"  Avg MIC: {np.mean(scores):.2f} μg/mL | Best MIC: {np.min(scores):.2f} μg/mL")
+    # Find best overall candidates
+    all_candidates = []
+    for cfg_name, results in all_results.items():
+        all_candidates.extend(results)
+    if all_candidates:
+        print(f"\n{'='*60}")
+        print("TOP 10 OVERALL CANDIDATES (All CFG Settings)")
+        print(f"{'='*60}")
+        top_overall = sorted(all_candidates, key=lambda x: x['apex_score'])[:10]
+        for i, candidate in enumerate(top_overall):
+            print(f"{i+1:2d}. MIC: {candidate['apex_score']:.2f} μg/mL | "
+                  f"CFG: {candidate['cfg_setting']} | "
+                  f"Sequence: {candidate['sequence']}")
+        # Create output directory if it doesn't exist
+        output_dir = '/data2/edwardsun/apex_results'
+        os.makedirs(output_dir, exist_ok=True)
+        # Save overall results with date
+        overall_results_file = os.path.join(output_dir, f'apex_results_all_cfg_comparison_{today}.json')
+        with open(overall_results_file, 'w') as f:
+            json.dump({
+                'metadata': {
+                    'date': today,
+                    'total_sequences': len(all_candidates),
+                    'apex_available': APEX_AVAILABLE,
+                    'cfg_settings_tested': list(all_results.keys())
+                },
+                'results': all_candidates
+            }, f, indent=2)
+        print(f"\n✓ Overall results saved to {overall_results_file}")
+        # Save comprehensive MIC summary
+        mic_summary_file = os.path.join(output_dir, f'mic_summary_{today}.json')
+        mic_summary = {
+            'date': today,
+            'summary_by_cfg': {},
+            'all_mics': [r['apex_score'] for r in all_candidates],
+            'amp_count': sum(1 for r in all_candidates if r['is_amp']),
+            'total_sequences': len(all_candidates)
+        }
+        for cfg_name, results in all_results.items():
+            if results:
+                scores = [r['apex_score'] for r in results]
+                amp_count = sum(1 for r in results if r['is_amp'])
+                mic_summary['summary_by_cfg'][cfg_name] = {
+                    'num_sequences': len(results),
+                    'amp_count': amp_count,
+                    'amp_percentage': amp_count/len(results)*100,
+                    'avg_mic': np.mean(scores),
+                    'min_mic': np.min(scores),
+                    'max_mic': np.max(scores),
+                    'std_mic': np.std(scores),
+                    'all_mics': scores
+                }
+        with open(mic_summary_file, 'w') as f:
+            json.dump(mic_summary, f, indent=2)
+        print(f"✓ MIC summary saved to {mic_summary_file}")
+    print(f"\n✅ APEX testing completed successfully!")
+    print(f"Tested {len(all_candidates)} total sequences across all CFG settings")
+if __name__ == "__main__":
+    main()

usage_example.py ADDED Viewed

	@@ -0,0 +1,60 @@

+#!/usr/bin/env python3
+"""
+FlowAMP Usage Example
+This script demonstrates how to use the FlowAMP model for AMP generation.
+Note: This is a demonstration version. For full functionality, you'll need to train the model.
+"""
+import torch
+from final_flow_model import AMPFlowMatcherCFGConcat
+def main():
+    print("=== FlowAMP Usage Example ===")
+    print("This demonstrates the model architecture and usage.")
+    if torch.cuda.is_available():
+        device = torch.device("cuda")
+        print("Using CUDA")
+    else:
+        device = torch.device("cpu")
+        print("Using CPU")
+    # Initialize model
+    model = AMPFlowMatcherCFGConcat(
+        hidden_dim=480,
+        compressed_dim=80,
+        n_layers=4,
+        n_heads=8,
+        dim_ff=1920,
+        dropout=0.1,
+        max_seq_len=25,
+        use_cfg=True
+    ).to(device)
+    print("Model initialized successfully!")
+    print(f"Model parameters: {sum(p.numel() for p in model.parameters()):,}")
+    # Demonstrate model forward pass
+    batch_size = 2
+    seq_len = 25
+    compressed_dim = 80
+    # Create dummy input
+    x = torch.randn(batch_size, seq_len, compressed_dim).to(device)
+    time_steps = torch.rand(batch_size, 1).to(device)
+    # Forward pass
+    with torch.no_grad():
+        output = model(x, time_steps)
+    print(f"Input shape: {x.shape}")
+    print(f"Output shape: {output.shape}")
+    print("✓ Model forward pass successful!")
+    print("\nTo use this model for AMP generation:")
+    print("1. Train the model using the provided training scripts")
+    print("2. Use generate_amps.py for peptide generation")
+    print("3. Use test_generated_peptides.py for evaluation")
+if __name__ == "__main__":
+    main()