Update app.py
Browse files
app.py
CHANGED
|
@@ -18,8 +18,10 @@ pipe = pipeline("text-generation", model=zephyr_model, torch_dtype=torch.bfloat1
|
|
| 18 |
|
| 19 |
standard_sys = f"""
|
| 20 |
You will be provided a list of visual events, and an audio description. All these informations come from a single video.
|
|
|
|
| 21 |
List of visual events are actually extracted from this video every 12 frames.
|
| 22 |
-
These visual infos are extracted from
|
|
|
|
| 23 |
As a smart assistant, you must understand that Repetitive visual element of the same person or group of subject means that it is the same person/subject, filmed without cut.
|
| 24 |
For example, if visual elements is like this:
|
| 25 |
"An older man wearing a brown hat and glasses, looking off into the distance.
|
|
@@ -27,10 +29,10 @@ For example, if visual elements is like this:
|
|
| 27 |
An older man wearing a brown hat and glasses, with a beard and a beard on his chin, is looking at the camera."
|
| 28 |
It does not mean there are 3 older men, but this is the same man. Because we have extracted vere close frame from the video sequence.
|
| 29 |
|
| 30 |
-
|
|
|
|
|
|
|
| 31 |
|
| 32 |
-
Your job is to use these informatios to smartly deduce and provide a very short resume about what is happening in the video.
|
| 33 |
-
Keep it short.
|
| 34 |
"""
|
| 35 |
|
| 36 |
def extract_frames(video_in, interval=24, output_format='.jpg'):
|
|
|
|
| 18 |
|
| 19 |
standard_sys = f"""
|
| 20 |
You will be provided a list of visual events, and an audio description. All these informations come from a single video.
|
| 21 |
+
|
| 22 |
List of visual events are actually extracted from this video every 12 frames.
|
| 23 |
+
These visual infos are extracted from the video that is usually a short sequence.
|
| 24 |
+
|
| 25 |
As a smart assistant, you must understand that Repetitive visual element of the same person or group of subject means that it is the same person/subject, filmed without cut.
|
| 26 |
For example, if visual elements is like this:
|
| 27 |
"An older man wearing a brown hat and glasses, looking off into the distance.
|
|
|
|
| 29 |
An older man wearing a brown hat and glasses, with a beard and a beard on his chin, is looking at the camera."
|
| 30 |
It does not mean there are 3 older men, but this is the same man. Because we have extracted vere close frame from the video sequence.
|
| 31 |
|
| 32 |
+
Audio events are actually the scene description based on the audio of the video.
|
| 33 |
+
|
| 34 |
+
Your job is to use these informations to smartly deduce and provide a very short resume about what is happening in the video.
|
| 35 |
|
|
|
|
|
|
|
| 36 |
"""
|
| 37 |
|
| 38 |
def extract_frames(video_in, interval=24, output_format='.jpg'):
|