This is a project which is currently making use of HPC facilities at Newcastle University. It is active.
For further information about this project, please contact:
'They've Got Your Eyes' is a 15-min film commission using AI models to 'rotoscope' onto live action footage
I have been researching AI and machine learning technologies for the last few years as part of my NUAcT Fellowship. My desire with this work is to test the limits of the technology creatively, while also making a film that poses questions about the technology itself.
My ambition to work primarily with local, server-based models is driven by a desire to mitigate some of the high energy usage associated with cloud-based generative AI programmes.
Technical Production Process and Requirements
I'm currently using the following open source models in conjunction with ComfyUI to output video. Linux is fine for ComfyUI.
Wan2.2 VACE T2V 14b
Wan Lightx2v
Depth Anything V2
Native UMT5 scaled (a clip model for text)
The process involves inputting a video that is shot on green-screen, with roughly composited backgrounds. She has trained models in Flux and SDXL to generate still frames to run into the Wan2.2 model. WIP test
To achieve this, I need access to high-powered GPUs. Initial tests of the workflow have found that these specs or higher would be ideal for achieving this post-production process:
A10G, A5000
24GB VRAM | 32GB RAM | 8 vCPUs
Storage Requirements
An average clip is approximately 1-2MB and about 30 seconds long. To conserve data storage, we can disable preview image and depth preview. For a film totalling 15 minutes in duration, we estimate needing between 500GB-1TB of storage.
Node hour - training time
One run of a 30 - 45 second clip takes about 5 - 10 min on a large GPU right now at 8fps. If the VRAM and GPU is more powerful this could be reduced to a few minutes. Running at a higher frame rate of 25 fps, which will be the final frame rate of the film, could take 30 minutes per 30 second clip.
Each clip may need 2 or 3 runs.
Based on this, and anticipating using a GPU-L for 744 hours (maximum)
There is no training or fine-tuning required for the models at this stage in order to process the generative video, images and text inputs and outputs.