If you are not already aware, Sora is the latest in a series of groundbreaking models released by OpenAI. Which can create photorealistic short videos from a text prompt.
Here’s an example it produced of two ships sailing in a storm in a coffee cup:
How it works
The OpenAi research paper Video generation models as world simulators, describes Sora as a text-conditional diffusion model, leveraging transformer architecture which operates on spacetime patches.
A diffusion model adds noise during its forward diffusion process, and then learns features by removing the nosie in the reverse diffusion process.
The transformer architectire was famously used for ChatGPT, which is a large language model (LLM). These basically work using an “attention” mechanism, where they look at surrounding context to help learn patterns.
A spacetime patch is basically a compressed representation of a video expressed in low dimensional latent space. The reason it is called a spacetime patch, is because both spatial and temporal information are compressed.
To get the final video output, a decoder model has also been trained which maps the latent representations back to pixel space.
Emergent capabilities
On the surface, one may assume that Sora is simply mapping pixels from one frame to the next. In a sense it is doing this, and much more.
In the example video of the two ships in a coffee cup, notice how naturally the waves form and the ships move in a motion that seems intuitive. This shows that Sora has an understanding of real world physics, which traditionally where hand coded using physics equations.
Coherence and permanence are also part of Sora’s achievments. When producing a shot with a moving camera, ojects do not simply disapper when the come back in frame or randomly change position.
Combining these capabilities, not only is it conceivalbe that Sora lays the groundwork for blockbuster movies, but may also be used to created virtual game worlds.
Although Sora is currently only open to a few people at the moment, and can still struggles with occasional glitches, it is exciting to see where we will be a few iterations down the line.