The text-to-video generation task can provide people with rich and diverse video content, but it also has some typical issues, such as content inconsistency between video frames or text alignment failure, which degrade the smoothness of video. And in the process of improving the video smoothing problems, the background texture and artistic expression are often lost because of the excessive smoothing. Based on the above problems, this paper proposes INR Smooth, a type of video smoothing strategy based on the relationship between interframe noise, which can improve the smoothness of most T2V generation tasks.
View Article and Find Full Text PDFIn this study, we demonstrate the significant development potential of diffusion U-Net extraction features transferred to the frequency domain, opening up a new perspective for diffusion models and generating a new optimization idea for diffusion model-related research. The generating quality of Text-to-Image (T2I) or Text-to-Video (T2V) can be significantly enhanced by modifying the key indicators of U-Net frequency domain features. We first investigated the two types of modules in the sampling process, CrossAttnUpBlock and UpBlock, on U-Net, and then examined the effect of fine-tuning modules on U-Net feature extraction from backbone and lateral skip connections.
View Article and Find Full Text PDF