Cursor for video editing doesn't exist for a reason.
Since I've been tagged under this post multiple times and we have first-mover advantage in the space, I felt obligated to share our findings and explain why video editing isn’t on the same level as code editing (yet):
- There is no "VS Code for video editing." There's no open source professional video editing UI that fits this use case, so you first need to build an AI friendly editor before anything else. Adding AI on top of that is the easy part.
- Building NLEs is VERY VERY hard. If you're building one from scratch you'll spend at least 2-3 years (full time) on table-stake features, and there are very few ways to make money before that. On top of that you have to operate in one of the hardest problem spaces in software engineering. I've seen countless video editing startups fail due to technical complexity since I started this company in 2023.
- You'll have to master delayed gratification. With 2-3 years just to build the foundation, you'll have to say no to a lot of temptations, like building a VS Code fork instead that can generate serious revenue in a few months. And once it becomes plausible that you can succeed at building a competitive NLE, you'll get a flood of job offers from well funded startups and corporations that you need to resist. You have to genuinely care about video processing and be intrinsically motivated by something other than "I want to make money fast."
- There is no StackOverflow or GitHub for video editing. You have to teach LLMs a lot of custom constructs, whereas they already have billions of lines of general code in their training data.
- Multimodal requirements are a huge challenge. A coding agent only deals with text in and text out. A video editing agent needs to handle audio, video and images at the same time, which is far harder to process and much more expensive.
- Chat is not the best UX for video editing. Often it’s faster to just make cuts on the timeline than type instructions into a chat box and wait for the result, especially when the LLM makes lots of mistakes that you then have to fix. Our view is that it’s better to have AI actions you can trigger with a button than to describe everything in a prompt.
- Videos require far more bandwidth than text. Uploading footage to the cloud can take hours, while text is instantly accessible anywhere. We experimented with local models (FastVLM), but that path is a dead end due to context window limits. You’re better off with an upload flow.
At Diffusion we don’t see a chat sidebar as our core advantage, but as a helpful feature in specific situations. We’d rather invest the majority of our time into building the best NLE UX than trying to automate everything.