Add abstract to model card

#3
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +76 -34
README.md CHANGED
@@ -1,17 +1,17 @@
1
  ---
2
- license: mit
3
- pipeline_tag: image-text-to-text
4
- library_name: transformers
5
  base_model:
6
- - OpenGVLab/InternVL2_5-26B
7
- base_model_relation: finetune
8
  datasets:
9
- - OpenGVLab/MMPR-v1.1
10
  language:
11
- - multilingual
 
 
 
12
  tags:
13
- - internvl
14
- - custom_code
 
15
  ---
16
 
17
  # InternVL2_5-26B-MPO
@@ -24,6 +24,10 @@ tags:
24
  <img width="500" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/64006c09330a45b03605bba3/zJsd2hqd3EevgXo6fNgC-.png">
25
  </div>
26
 
 
 
 
 
27
  ## Introduction
28
 
29
  We introduce InternVL2.5-MPO, an advanced multimodal large language model (MLLM) series that demonstrates superior overall performance. This series builds upon InternVL2.5 and Mixed Preference Optimization.
@@ -113,7 +117,7 @@ Additionally, the BCO loss is employed as the quality loss, which helps the mode
113
  The loss function is defined as:
114
 
115
  $$
116
- \mathcal{L}_{\text{q}}=\mathcal{L}_{\text{q}}^+ + \mathcal{L}_{\text{q}}^-,
117
  $$
118
 
119
  where \\(\mathcal{L}_{\text{q}}^{+}\\) and \\(\mathcal{L}_{\text{q}}^{+}\\) represent the loss for chosen and rejected responses, respectively.
@@ -344,40 +348,50 @@ generation_config = dict(max_new_tokens=1024, do_sample=True)
344
  # pure-text conversation (纯文本对话)
345
  question = 'Hello, who are you?'
346
  response, history = model.chat(tokenizer, None, question, generation_config, history=None, return_history=True)
347
- print(f'User: {question}\nAssistant: {response}')
 
348
 
349
  question = 'Can you tell me a story?'
350
  response, history = model.chat(tokenizer, None, question, generation_config, history=history, return_history=True)
351
- print(f'User: {question}\nAssistant: {response}')
 
352
 
353
  # single-image single-round conversation (单图单轮对话)
354
- question = '<image>\nPlease describe the image shortly.'
 
355
  response = model.chat(tokenizer, pixel_values, question, generation_config)
356
- print(f'User: {question}\nAssistant: {response}')
 
357
 
358
  # single-image multi-round conversation (单图多轮对话)
359
- question = '<image>\nPlease describe the image in detail.'
 
360
  response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
361
- print(f'User: {question}\nAssistant: {response}')
 
362
 
363
  question = 'Please write a poem according to the image.'
364
  response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=history, return_history=True)
365
- print(f'User: {question}\nAssistant: {response}')
 
366
 
367
  # multi-image multi-round conversation, combined images (多图多轮对话,拼接图像)
368
  pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
369
  pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
370
  pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
371
 
372
- question = '<image>\nDescribe the two images in detail.'
 
373
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
374
  history=None, return_history=True)
375
- print(f'User: {question}\nAssistant: {response}')
 
376
 
377
  question = 'What are the similarities and differences between these two images.'
378
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
379
  history=history, return_history=True)
380
- print(f'User: {question}\nAssistant: {response}')
 
381
 
382
  # multi-image multi-round conversation, separate images (多图多轮对话,独立图像)
383
  pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
@@ -385,17 +399,20 @@ pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat1
385
  pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
386
  num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
387
 
388
- question = 'Image-1: <image>\nImage-2: <image>\nDescribe the two images in detail.'
 
 
389
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
390
  num_patches_list=num_patches_list,
391
  history=None, return_history=True)
392
- print(f'User: {question}\nAssistant: {response}')
 
393
 
394
  question = 'What are the similarities and differences between these two images.'
395
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
396
- num_patches_list=num_patches_list,
397
- history=history, return_history=True)
398
- print(f'User: {question}\nAssistant: {response}')
399
 
400
  # batch inference, single image per sample (单图批处理)
401
  pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
@@ -403,13 +420,15 @@ pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat1
403
  num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
404
  pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
405
 
406
- questions = ['<image>\nDescribe the image in detail.'] * len(num_patches_list)
 
407
  responses = model.batch_chat(tokenizer, pixel_values,
408
  num_patches_list=num_patches_list,
409
  questions=questions,
410
  generation_config=generation_config)
411
  for question, response in zip(questions, responses):
412
- print(f'User: {question}\nAssistant: {response}')
 
413
 
414
  # video multi-round conversation (视频多轮对话)
415
  def get_index(bound, fps, max_frame, first_idx=0, num_segments=32):
@@ -447,17 +466,24 @@ def load_video(video_path, bound=None, input_size=448, max_num=1, num_segments=3
447
  video_path = './examples/red-panda.mp4'
448
  pixel_values, num_patches_list = load_video(video_path, num_segments=8, max_num=1)
449
  pixel_values = pixel_values.to(torch.bfloat16).cuda()
450
- video_prefix = ''.join([f'Frame{i+1}: <image>\n' for i in range(len(num_patches_list))])
 
451
  question = video_prefix + 'What is the red panda doing?'
452
- # Frame1: <image>\nFrame2: <image>\n...\nFrame8: <image>\n{question}
 
 
 
 
453
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
454
  num_patches_list=num_patches_list, history=None, return_history=True)
455
- print(f'User: {question}\nAssistant: {response}')
 
456
 
457
  question = 'Describe this video in detail.'
458
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
459
  num_patches_list=num_patches_list, history=history, return_history=True)
460
- print(f'User: {question}\nAssistant: {response}')
 
461
  ```
462
 
463
  #### Streaming Output
@@ -539,7 +565,9 @@ image_urls=[
539
 
540
  images = [load_image(img_url) for img_url in image_urls]
541
  # Numbering images improves multi-image conversations
542
- response = pipe((f'Image-1: {IMAGE_TOKEN}\nImage-2: {IMAGE_TOKEN}\ndescribe these two images', images))
 
 
543
  print(response.text)
544
  ```
545
 
@@ -648,8 +676,12 @@ If you find this project useful in your research, please consider citing:
648
  @article{chen2024far,
649
  title={How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites},
650
  author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and others},
651
- journal={arXiv preprint arXiv:2404.16821},
652
- year={2024}
 
 
 
 
653
  }
654
  @inproceedings{chen2024internvl,
655
  title={Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks},
@@ -659,3 +691,13 @@ If you find this project useful in your research, please consider citing:
659
  year={2024}
660
  }
661
  ```
 
 
 
 
 
 
 
 
 
 
 
1
  ---
 
 
 
2
  base_model:
3
+ - OpenGVLab/InternVL2_5-26B
 
4
  datasets:
5
+ - OpenGVLab/MMPR-v1.1
6
  language:
7
+ - multilingual
8
+ library_name: transformers
9
+ license: mit
10
+ pipeline_tag: image-text-to-text
11
  tags:
12
+ - internvl
13
+ - custom_code
14
+ base_model_relation: finetune
15
  ---
16
 
17
  # InternVL2_5-26B-MPO
 
24
  <img width="500" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/64006c09330a45b03605bba3/zJsd2hqd3EevgXo6fNgC-.png">
25
  </div>
26
 
27
+ ## Abstract
28
+
29
+ We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series that builds upon InternVL 2.0, maintaining its core model architecture while introducing significant enhancements in training and testing strategies as well as data quality. In this work, we delve into the relationship between model scaling and performance, systematically exploring the performance trends in vision encoders, language models, dataset sizes, and test-time configurations. Through extensive evaluations on a wide range of benchmarks, including multi-discipline reasoning, document understanding, multi-image / video understanding, real-world comprehension, multimodal hallucination detection, visual grounding, multilingual capabilities, and pure language processing, InternVL 2.5 exhibits competitive performance, rivaling leading commercial models such as GPT-4o and Claude-3.5-Sonnet. Notably, our model is the first open-source MLLMs to surpass 70% on the MMMU benchmark, achieving a 3.7-point improvement through Chain-of-Thought (CoT) reasoning and showcasing strong potential for test-time scaling. We hope this model contributes to the open-source community by setting new standards for developing and applying multimodal AI systems.
30
+
31
  ## Introduction
32
 
33
  We introduce InternVL2.5-MPO, an advanced multimodal large language model (MLLM) series that demonstrates superior overall performance. This series builds upon InternVL2.5 and Mixed Preference Optimization.
 
117
  The loss function is defined as:
118
 
119
  $$
120
+ \mathcal{L}_{\text{q}}=\mathcal{L}_{\text{q}}^+ + \mathcal{L}_{\text{q}}^-,\
121
  $$
122
 
123
  where \\(\mathcal{L}_{\text{q}}^{+}\\) and \\(\mathcal{L}_{\text{q}}^{+}\\) represent the loss for chosen and rejected responses, respectively.
 
348
  # pure-text conversation (纯文本对话)
349
  question = 'Hello, who are you?'
350
  response, history = model.chat(tokenizer, None, question, generation_config, history=None, return_history=True)
351
+ print(f'User: {question}
352
+ Assistant: {response}')
353
 
354
  question = 'Can you tell me a story?'
355
  response, history = model.chat(tokenizer, None, question, generation_config, history=history, return_history=True)
356
+ print(f'User: {question}
357
+ Assistant: {response}')
358
 
359
  # single-image single-round conversation (单图单轮对话)
360
+ question = '<image>
361
+ Please describe the image shortly.'
362
  response = model.chat(tokenizer, pixel_values, question, generation_config)
363
+ print(f'User: {question}
364
+ Assistant: {response}')
365
 
366
  # single-image multi-round conversation (单图多轮对话)
367
+ question = '<image>
368
+ Please describe the image in detail.'
369
  response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
370
+ print(f'User: {question}
371
+ Assistant: {response}')
372
 
373
  question = 'Please write a poem according to the image.'
374
  response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=history, return_history=True)
375
+ print(f'User: {question}
376
+ Assistant: {response}')
377
 
378
  # multi-image multi-round conversation, combined images (多图多轮对话,拼接图像)
379
  pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
380
  pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
381
  pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
382
 
383
+ question = '<image>
384
+ Describe the two images in detail.'
385
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
386
  history=None, return_history=True)
387
+ print(f'User: {question}
388
+ Assistant: {response}')
389
 
390
  question = 'What are the similarities and differences between these two images.'
391
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
392
  history=history, return_history=True)
393
+ print(f'User: {question}
394
+ Assistant: {response}')
395
 
396
  # multi-image multi-round conversation, separate images (多图多轮对话,独立图像)
397
  pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
 
399
  pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
400
  num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
401
 
402
+ question = 'Image-1: <image>
403
+ Image-2: <image>
404
+ Describe the two images in detail.'
405
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
406
  num_patches_list=num_patches_list,
407
  history=None, return_history=True)
408
+ print(f'User: {question}
409
+ Assistant: {response}')
410
 
411
  question = 'What are the similarities and differences between these two images.'
412
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
413
+ num_patches_list=num_patches_list, history=history, return_history=True)
414
+ print(f'User: {question}
415
+ Assistant: {response}')
416
 
417
  # batch inference, single image per sample (单图批处理)
418
  pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
 
420
  num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
421
  pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
422
 
423
+ questions = ['<image>
424
+ Describe the image in detail.'] * len(num_patches_list)
425
  responses = model.batch_chat(tokenizer, pixel_values,
426
  num_patches_list=num_patches_list,
427
  questions=questions,
428
  generation_config=generation_config)
429
  for question, response in zip(questions, responses):
430
+ print(f'User: {question}
431
+ Assistant: {response}')
432
 
433
  # video multi-round conversation (视频多轮对话)
434
  def get_index(bound, fps, max_frame, first_idx=0, num_segments=32):
 
466
  video_path = './examples/red-panda.mp4'
467
  pixel_values, num_patches_list = load_video(video_path, num_segments=8, max_num=1)
468
  pixel_values = pixel_values.to(torch.bfloat16).cuda()
469
+ video_prefix = ''.join([f'Frame{i+1}: <image>
470
+ ' for i in range(len(num_patches_list))])
471
  question = video_prefix + 'What is the red panda doing?'
472
+ # Frame1: <image>
473
+ Frame2: <image>
474
+ ...
475
+ Frame8: <image>
476
+ {question}
477
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
478
  num_patches_list=num_patches_list, history=None, return_history=True)
479
+ print(f'User: {question}
480
+ Assistant: {response}')
481
 
482
  question = 'Describe this video in detail.'
483
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
484
  num_patches_list=num_patches_list, history=history, return_history=True)
485
+ print(f'User: {question}
486
+ Assistant: {response}')
487
  ```
488
 
489
  #### Streaming Output
 
565
 
566
  images = [load_image(img_url) for img_url in image_urls]
567
  # Numbering images improves multi-image conversations
568
+ response = pipe((f'Image-1: {IMAGE_TOKEN}
569
+ Image-2: {IMAGE_TOKEN}
570
+ describe these two images', images))
571
  print(response.text)
572
  ```
573
 
 
676
  @article{chen2024far,
677
  title={How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites},
678
  author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and others},
679
+ journal={Science China Information Sciences},
680
+ volume={67},
681
+ number={12},
682
+ pages={220101},
683
+ year={2024},
684
+ publisher={Springer}
685
  }
686
  @inproceedings{chen2024internvl,
687
  title={Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks},
 
691
  year={2024}
692
  }
693
  ```
694
+
695
+ ## Acknowledgement
696
+
697
+ InternVL is built with reference to the code of the following projects: [OpenAI CLIP](https://github.com/openai/CLIP), [Open CLIP](https://github.com/mlfoundations/open_clip), [CLIP Benchmark](https://github.com/LAION-AI/CLIP_benchmark), [EVA](https://github.com/baaivision/EVA/tree/master), [InternImage](https://github.com/OpenGVLab/InternImage), [ViT-Adapter](https://github.com/czczup/ViT-Adapter), [MMSegmentation](https://github.com/open-mmlab/mmsegmentation), [Transformers](https://github.com/huggingface/transformers), [DINOv2](https://github.com/facebookresearch/dinov2), [BLIP-2](https://github.com/salesforce/LAVIS/tree/main/projects/blip2), [Qwen-VL](https://github.com/QwenLM/Qwen-VL/tree/master/eval_mm), and [LLaVA-1.5](https://github.com/haotian-liu/LLaVA). Thanks for their awesome work!
698
+
699
+ ______________________________________________________________________
700
+
701
+ Scan the following QR Code, join our WeChat group.
702
+
703
+ <p align="center"><img width="300" alt="image" src="https://github.com/user-attachments/assets/f776df09-ebba-4fd5-80c2-fec4ff1518be"></p>