Using GROVER model on hg38 data
According to DNA grover paper https://www.nature.com/articles/s42256-024-00872-0 The model can be directly implemented with Python for any suitable fine-tuning task. The vocabulary ( (BPE-600)) for the tokenized hg19 genome (600 cycles) is available as a data resource for fine-tuning models based on GROVER and can also be used to train different model architectures or for different purposes.
Also in the provided code https://zenodo.org/records/13315363 is written they had listed each CTCF motif in the genome which has been detected by the FIMO software. For this specific task, they calculated the center of the motif and then add 500 nucleotides up- and downstream. This is to capture the sequence context of the motif, which is what GROVER was trained to work with.
My questions are:
- If I can fine tune the grover model when the hg38 reference genome that is used by my dataset? (I have both .fastq files and .bam files and I can use either of them). I need to add the underlying "language" of DNA is the same, some genomic features and coordinates differ between hg19 and hg38.
- Should I use hg19.fa reference genome to convert my .bam files to fasta?