Spaces:

fffiloni
/

soft-video-understanding

Paused

fffiloni commited on Mar 5, 2024

Commit

2aabdcd

verified ·

1 Parent(s): 86c15f7

Update app.py

Files changed (1) hide show

app.py CHANGED Viewed

@@ -18,8 +18,10 @@ pipe = pipeline("text-generation", model=zephyr_model, torch_dtype=torch.bfloat1
 standard_sys = f"""
 You will be provided a list of visual events, and an audio description. All these informations come from a single video.
 List of visual events are actually extracted from this video every 12 frames.
-These visual infos are extracted from a video that is usually a short sequenc.
 As a smart assistant, you must understand that Repetitive visual element of the same person or group of subject means that it is the same person/subject, filmed without cut.
 For example, if visual elements is like this:
 "An older man wearing a brown hat and glasses, looking off into the distance.
@@ -27,10 +29,10 @@ For example, if visual elements is like this:
  An older man wearing a brown hat and glasses, with a beard and a beard on his chin, is looking at the camera."
 It does not mean there are 3 older men, but this is the same man. Because we have extracted vere close frame from the video sequence.
-In the meantme, Audio events are actually the scene description based on the audio of the video.
-Your job is to use these informatios to smartly deduce and provide a very short resume about what is happening in the video.
-Keep it short.
 """
 def extract_frames(video_in, interval=24, output_format='.jpg'):

 standard_sys = f"""
 You will be provided a list of visual events, and an audio description. All these informations come from a single video.
 List of visual events are actually extracted from this video every 12 frames.
+These visual infos are extracted from the video that is usually a short sequence.
 As a smart assistant, you must understand that Repetitive visual element of the same person or group of subject means that it is the same person/subject, filmed without cut.
 For example, if visual elements is like this:
 "An older man wearing a brown hat and glasses, looking off into the distance.
  An older man wearing a brown hat and glasses, with a beard and a beard on his chin, is looking at the camera."
 It does not mean there are 3 older men, but this is the same man. Because we have extracted vere close frame from the video sequence.
+Audio events are actually the scene description based on the audio of the video.
+Your job is to use these informations to smartly deduce and provide a very short resume about what is happening in the video.
 """
 def extract_frames(video_in, interval=24, output_format='.jpg'):