CHANGELOG MAY 09, 2026 5 min

SkyReels V4 lands in the catalog: video with sound from text or image

A new video AI from Skywork. Describe a scene or upload an image and you get a finished clip up to 15 seconds long, with sound and voices baked in. Live on NetRoom.

NetRoom

EDITORIAL, NETROOM

What's new

The NetRoom catalog now includes SkyReels V4, a video AI from Skywork. It's a single model that covers almost everything you'd want to do with video — build a clip from a written description, animate a still image, change a finished cut, extend a clip. The headline feature is that it generates its own sound — voices, room tone, ambient noise, music and effects.

What it can do

SkyReels V4 bundles six different modes that used to require separate models. They all live behind a single window.

1. Video from a description

Describe the shot — characters, setting, motion, mood — and the model builds the clip from scratch. Great for fast concepts, client mockups or short social posts when you don't have any source footage on hand.

2. Animating a still

Upload one, two or three of your own images and the model connects them with motion. You can tell it which image is the opening frame, which is the closing one, which sits in the middle — it draws how the camera and the scene get from one to the next. Useful when the art director already has the closing shot and you need to show how the camera lands on it.

3. Reworking a finished clip

Pass in your video plus a description of what should change. The model swaps the wardrobe, rewrites the weather, moves the motion of one subject onto another while keeping the source composition. The original audio stays in place.

4. Extending a clip

When the client comes back with «add another five seconds», the model continues the tail in the same style and rhythm. Seamless, no visible cuts or seams.

5. Local edits

In a finished frame you can change one area — the background through a window, a sign on the wall, the color of a car, the outfit on a character — and leave the rest of the shot untouched. No manual repainting.

6. Voice-driven lip-sync

Hand the model a voice sample up to 15 seconds long — it lines it up with the character's mouth movement. Useful for avatars, spokesperson ads and any scene with dialogue.

Sound included

Flip one switch and SkyReels V4 generates a soundtrack alongside the video — footsteps on a floor, fabric rustle, voices, breaking waves, city ambience, mood-fitting music. Everything is synced to what's happening on screen. For most drafts and mid-tier finals, a separate sound pass is no longer required.

Tips for better results

Describe the scene in detail: what's in the frame, how the camera moves, what the lighting looks like, the overall mood. More detail means a closer match.
For a specific character or location, attach an image. The model preserves the face, the costume, the texture of the place.
Shorter clips (5–8 seconds) tend to land more reliably — there's less room for the story to drift toward the end.
If something turns out wrong, don't argue with «no hat» or «no beard» — the model reads positive instructions much better than negatives. Rephrase to what you do want to see.

Specs

Quality: up to 1080p, 32 frames per second.
Clip length: 3 to 15 seconds.
Aspect ratios: 16:9 horizontal (YouTube, banners), 9:16 vertical (TikTok, stories), 4:3 classic, 3:4 portrait, 1:1 square (Instagram feed).
Output formats: MP4, WEBM, MOV.
Up to four different takes of the same scene per run — pick whichever lands closest.

Where to try it

Model page: /model/skyreels-v4/. You can run it from chat, the catalog or via API. The mode kicks in automatically based on what you send: text alone, text with an image, or text with a video.