Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New unified video IO #61

Open
Kamino666 opened this issue Aug 24, 2022 · 13 comments
Open

New unified video IO #61

Kamino666 opened this issue Aug 24, 2022 · 13 comments
Assignees

Comments

@Kamino666
Copy link
Contributor

Your recent changes are great and have significantly reduced code redundancy. The code is also easier to read now. Here I have another idea that might make it a little better.

The process of extracting video frames is now defined in base_flow_extractor.py and base_framewise_extractor.py, which use ffmpeg to re-encode the video and opencv to read it. But users may need a new feature of extracting a fixed number of frames for a video, which is harder to implement with the current framework. And, users may want to be able to extract features faster, so could you consider using a more efficient API to read videos (e.g., decord). Moreover, specifying fps to extract features requires re-encoding the video and generating temporary files. Is this necessary? Can we meet the fps by extracting frames at intervals rather than re-encoding it?

The ideal API may looks like:

# when fps = 2
frame_extractor = FrameExtractor("sample.mp4", batchsize=16, fps=2)
# when we want to extract 12 frames
frame_extractor = FrameExtractor("sample.mp4", batchsize=16, total=12)
# when extracting flow 
frame_extractor = FrameExtractor("sample.mp4", batchsize=16, overlap=1)

for batch, time_stamp in frame_extractor:
    batch = transform(batch)  # or as a parameter for the frame extractor
    _run_on_a_batch(batch)

In summary, this new feature can further reduce code redundancy, improve speed and provide new extracting methods.

I would love to make this improvement, but I've been busy lately and probably won't be able to do it until September.

@v-iashin
Copy link
Owner

Hi there!

Yees! I have been thinking about IO a lot and wanted to change it since the very inception of this project. Also, note that some of the features rely on cv2 while others on torchvision.something.read_video.

We need to unify it, and, hopefully, make it faster by picking another IO tool.

Also, we need to merge the frame-wise base extractor and flow base extractor after the io algorithm is unified.

specifying fps to extract features requires re-encoding the video and generating temporary files. Is this necessary?

I thought so and it is quite plain and easy to understand. I do admit that we could do a better job there. However, first, I need to figure out how FFmpeg converts 30 fps to 25, for instance, or some other non-trivial resampling, and, second, we need to reimplement the "skipping/duplicating" algorithm here. It seems like a rather overkill at this stage but I think you have a more clear view of how to do it. Could you share it? I would love to read it.


On a side note,

I dislike having cv2 as a dependency and the fact that it is quite slow but I like that I could read a video frame by frame and extract features from a batch. Note, if you will try to load the whole video with a reader and try to process it, we will quickly run out RAM.

Recently, I’ve been trying many video IO scripts:

  • ffmpeg-python and reading the frame from mp4 directly by converting the binary into numpy. As implemented in Antonie Miech’s S3D. Main con: need to load the whole video in RAM.
  • Extracting frames with ffmpeg cmd and load by an image IO. Main con: slow but also gross.
  • torchvision.read_video is good enough +/- it reads audio (convenient but sometimes unnecessary - makes extraction slower). Main con: reads the whole video and does not work out of the box after torchvision installation (has pyav as a dependency)
  • the brand new: torchvision.io.VideoReader. It is incredibly fast and I love it. It allows you to iterate either audio or RGB tracks of a mp4 frame-by-frame. However, if we could skip some frames in RGB, audio is read by 2048 or 1024 samples which I don't know how to resample on the fly (e.g. 44100 into 48000), the same as 30 fps into 25 fps. Another con is dependency hell. I installed it once and I don't even know how to do it again anymore and it depends on the latest torch(vision). One may find some issues in torch/vision repo regarding this new API. Another slightly strange local issue is related to the reconstruction of the audio read this way (mp4 -> VideoReader('audio') -> wav_tensor -> torchvision.write_video -> .wav file). It produced a higher-pitched audio in my environment yet worked on google colab. In other words, it is raw but I like it.
  • I think I tried Decord but never managed to fill in the needs that I have. I don't remember exactly my issue with it but I gave it a try.

@v-iashin v-iashin changed the title Suggest a new feature of frame extraction. New unified video IO Aug 24, 2022
@v-iashin
Copy link
Owner

I will rename the issue to make it more precise if I understand your suggestion correctly.

@Kamino666
Copy link
Contributor Author

Thank you for make the title more clear!

Regarding changing the FPS, I need to find more information about it. My initial idea was to first calculate the number of frames needed for the new fps, then use np.array(np.linspace(0, frame_num-1, new_frame_num)) to get the indices of the extracted frames, and finally extract the needed frames based on the indices. It has error in some cases, i.e. the frames obtained may not correspond exactly to the timestamp. The re-encoding method is more accurate, and the above method is faster. I'm not sure if this error can be ignored or how everyone else handles it.

@Kamino666
Copy link
Contributor Author

torchvision.io.VideoReader looks good! But it is really sad that we must use a high version of torch.

I think it's very good if Decord meets the needs.

We could try to find the best IO script as a start.

@Kamino666
Copy link
Contributor Author

By the way, ChangingFrameRate may be helpful.

@Kamino666
Copy link
Contributor Author

I did a benchmark on Decord, mmcv and OpenCV.

It seems OpenCV is faster when reading short videos sequentially, and decord is faster when reading long videos sequentially. These three modules don't load full video. Here are some questions:

  1. The decord might be faster when using gpu. I will try this.
  2. Never successfully install torchvision.io.VideoReader 😞

@v-iashin
Copy link
Owner

Oh, thanks a lot!

A few things I would like to clarify for myself:

  1. When we rely on video readers such as those that you benchmarked on, do we need to check memory footprint on the init call (such as cap = cv2.VideoCapture("long.mp4"))? These don't load the video into memory, do they? These are generators, not iterators - am I right?
  2. I see quite a small difference in speed performance (at most ~20% for random access). I think if it is the case, we could pick the one that has less dependency pain. Which one do you think is better from this perspective?
  3. Yes, I think before the last torch update you could run VideoReader on colab. I think it is no longer possible even there. This is very disappointing but I think the torch team knows about it.

@Kamino666
Copy link
Contributor Author

For the first point, yes, I think we need to check if the reader loads the whole video into RAM, and they do not. I don't really understand the difference between the generator and iterator you mentioned, but during the file processing, these readers probably only read some basic information about the video when they are initialized, after which they move the pointer according to the user input (I'm not sure of the details either).

For the second and the third points, I think we'd better keep OpenCV until torchvision.io.VideoReader is more easily accessible.

I also tested the gpu version of decord afterwards, and while it is relatively easier to install (compared to torchvision) and it has a very high boost when reading long videos sequentially, it seems to read the entire video into memory.

We may just need to change the existing IO API, but I still have a question, should we re-encode the video and use all frames after re-encoding? Or should we just read the video and extract the frames we need. Perhaps you would have more experience in this area? Do you know how other researchers have handled this?

@v-iashin
Copy link
Owner

v-iashin commented Aug 27, 2022

I don't really understand the difference between the generator and iterator you mentioned.

Iterators keep information in RAM and you just iterate through values in some way, e.g. [1, 2, 3] is an iterator. Generator generates new value. It means it does not need to store the whole list/video in RAM. I think the famous example of this differentiation is the range() function in Python 2 (list) and python 3 (generator).

should we re-encode the video and use all frames after re-encoding?
Yes, unless we will figure out a clean and well-known way of dealing with non-trivial source/target fps. For instance, trivial: from 30 to 10 (going in triplets: ABC and dropping A and C); non-trivial: from 24 to 15. Here, I would rather rely on existing implementations like ffmpeg to do it for me rather than write something with many edge cases myself as bugs can occur.

Usually, researchers just pre-encode the whole folder to the same fps. For instance, scrap YouTube, then re-encode with ffmpeg to 25 fps. I think, the reencode_wtih_another_fps does this but it just does it every time a person processes a video. It is negligible if the user extracts the features once but can be problematic if they need 10 different features as this will be run 10 times. Well, I think it is ok still, the process is very fast and not the bottleneck, so I don't share your concerns, really. Yet, I admit that it is ugly and I would want to have something else. Maybe we could replace it with something more pythonic like ffmpeg-python. It is the same but you don't need to write the command in a ffmpeg CLI kind of way. Take a look at ffmpeg-python, do you think it would be nice to replace the current script with it (I could do it, don't worry; the docs for this library are quite horrible by the way)?

@Kamino666
Copy link
Contributor Author

Thanks for your reply, and I agree with you that using ffmpeg solves the fps problem. I wasn't worried about the brevity of the code. In fact using command line calls to ffmpeg works quite well.

@v-iashin
Copy link
Owner

it seems to read the entire video into memory.

Ok, I decided to double-check it because I think it wasn't my impression the last time I checked.

Thanks for providing the colab MWE, I could quickly start experimenting.

Here is my 2nd honest attempt to use decord (it has some comments):
https://colab.research.google.com/drive/1sF4BjvFMkU6uBtupKgZ8_2ROCZBuEd-h?usp=sharing

By the way, did you compile it for GPU use or installed it in some other way?

@Kamino666
Copy link
Contributor Author

Sorry for the late reply, but I'm very busy these days as I'm at the beginning of the semester.

Your analysis of the decord is much better than mine, and I also feel that the CPU version decord is not much better than the cv2 for now. I compiled the source code of decord for GPU use. You can use the following code to install the GPU version of decord on colab:

!sudo apt-get update
!sudo apt-get install -y build-essential python3-dev python3-setuptools make cmake
!sudo apt-get install -y libavcodec-dev libavfilter-dev libavformat-dev libavutil-dev
!git clone --recursive https://github.com/dmlc/decord
%cd decord
!mkdir build
%cd build
!cmake .. -DUSE_CUDA=ON -DCMAKE_BUILD_TYPE=Release 
!make
%cd /content/decord/python
!python setup.py install --user
%cd /content

I tried the GPU version decord with your colab notebook. The results are as follows:

Filename: tmp.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
     5   75.531 MiB   75.531 MiB           1   @profile
     6                                         def my_func():
     7  204.941 MiB  129.410 MiB           1       vr = VideoReader('long.mp4', ctx=gpu(0))
     8  270.617 MiB   65.676 MiB           1       batch = vr.next()
     9  270.617 MiB    0.000 MiB           1       batch = vr.next()
    10  270.617 MiB    0.000 MiB           1       batch = vr.next()
    11  270.617 MiB    0.000 MiB           1       batch = vr.next()
    12  270.617 MiB    0.000 MiB           1       batch = vr.next()
    13  270.617 MiB    0.000 MiB           1       batch = vr.next()
    14  270.617 MiB    0.000 MiB           1       batch = vr.next()
    15  270.617 MiB    0.000 MiB           1       batch = vr.next()
    16  270.617 MiB    0.000 MiB           1       batch = vr.next()
    17                                             
    18  276.551 MiB    0.000 MiB         669       for i in range(1024):
    19  276.551 MiB    5.934 MiB         669           batch = vr.next()
CPU times: user 40.3 ms, sys: 11.1 ms, total: 51.5 ms
Wall time: 5.13 s

I also tried the CPU version with your code. The results are slightly different from yours in the notebook:

Filename: tmp.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
     5   76.547 MiB   76.547 MiB           1   @profile
     6                                         def my_func():
     7  107.945 MiB   31.398 MiB           1       vr = VideoReader('long.mp4')
     8  369.211 MiB  261.266 MiB           1       batch = vr.next()
     9  421.965 MiB   52.754 MiB           1       batch = vr.next()
    10  421.965 MiB    0.000 MiB           1       batch = vr.next()
    11  425.828 MiB    3.863 MiB           1       batch = vr.next()
    12  425.828 MiB    0.000 MiB           1       batch = vr.next()
    13  437.938 MiB   12.109 MiB           1       batch = vr.next()
    14  437.938 MiB    0.000 MiB           1       batch = vr.next()
    15  438.707 MiB    0.770 MiB           1       batch = vr.next()
    16  438.707 MiB    0.000 MiB           1       batch = vr.next()
    17                                             
    18  455.441 MiB    0.000 MiB         669       for i in range(1024):
    19  455.441 MiB   16.734 MiB         669           batch = vr.next()
CPU times: user 220 ms, sys: 40.2 ms, total: 261 ms
Wall time: 37.2 s

The GPU version uses less memory and is significantly faster (over 7 times faster).
Here are the changed codes: https://colab.research.google.com/drive/1rs80lFKHL9_kASY4DqwqnmxfxUq26zCg?usp=sharing

@v-iashin
Copy link
Owner

#70

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants