Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optional 'progressive' download frame #102

Closed
jakearchibald opened this issue Sep 9, 2020 · 57 comments
Closed

Optional 'progressive' download frame #102

jakearchibald opened this issue Sep 9, 2020 · 57 comments

Comments

@jakearchibald
Copy link

jakearchibald commented Sep 9, 2020

The lack of progressive rendering seems like a missing feature in AVIF. Sure, it's possible to do a top-to-bottom decode, but that isn't as good as multi-pass, and it sounds like the implementation is progressive.

Could another image be added to the start file, marked up in some way to indicate that it's optional for the browser to render it? Although it should render it if it doesn't have the rest of the file yet.

This part could be a different resolution to the main image, be encoded at a different quality, or be visually different (eg blurred).

For example, the cat image I used in the demo is 96 kB for full quality, but an acceptable preview can be done in 5 kB by halving the resolution and turning the quantizers to max.

I don't think this is the same as a thumbnail as a blurry version or a low-quality version would be unacceptable there.

@wellington1993
Copy link

I think that idea could be an essential step to make that format popular in the Web usage.
And could be "optional", don't affecting other usages of the format.

Nice suggestion! @jakearchibald

@wantehchang
Copy link
Collaborator

Encoding AVIF images for progressive rendering is being discussed in the AOM Still Picture Subgroup. It seems that both the HEIF container format and the AV1 bitstream format have the necessary features (HEIF's multi-layer images and AV1's spatial layers) to support this, but someone needs to work out the details and generate a multi-layer AVIF image as a proof of concept. Leo Barnes reported his findings in an email on this topic last week. I don't know HEIF well enough to understand Leo's email, but it looks like he figured out the HEIF part of the puzzle. So we are making progress!

@kornelski
Copy link

kornelski commented Sep 9, 2020

I'd like to note that Cloudflare is interested in this feature. We have an HTTP/2 server that understands structure of image formats, and multiplexes network streams in order to deliver progressive previews of all images in parallel as soon as possible. We have an AVIF converter already, and would like to implement progressive encoding.

@jakearchibald
Copy link
Author

I created a demo by loading the 5k 'preview' version, then changing the src to the full version https://jakearchibald.com/2020/avif-has-landed/#progressive-avif-demo.

Here's a the full demo, although it's Chrome-only https://jakearchibald.com/2020/avif-has-landed/demos/loading/

@paperboyo
Copy link

Just a comment from a mere user: while I think it’s a great suggestion (and the demo clearly shows the benefit, thanks!), I think ideally it would be properly* included in the standard (maybe in AV2, if impossible for AV1?).

While including an extra image to be used as a “fake” progressive has its benefits, it does increase the overall file size and presents its own problems: if the make-up of this (resolution reduction ratio to original, extra compression, bluriness, etc) is left entirely for the users to decide on, there are questions around best practice of what that make-up should be (extra 5K may be OK for a 100KB image, but not so for a 20KB image). Also, it may bring some visual chaos to the web that will be hard to come back from (not a fan of a plethora of placeholder image techniques, myself): what would prevent users from sticking a company logo as this extra frame?

*as I understand (and I don’t!), “proper” progressive would use the original image in a progression of standard (although tweakable) renders and wouldn’t use an extra frame. This way, connection quality naturally eliminates (perceptually) the intermediate stage, while in the “fake” progressive, there would be a flash of this unneeded really extra image (?). I hope the video-first nature of AV standard doesn’t prevent it from becoming properly progressive one day.

@kornelski
Copy link

@paperboyo progressive is impossible in AV1. It's too soon to predict how adoption and upgrade paths will look like for AV2 and AVIF. For example, WebP remained stuck with the old VP8 despite the world moving on to VP9, so it's possible that AVIF, at least on the Web, will live and die with AV1.

Progressive rendering is cheap in the old JPEG, because JPEG doesn't have prediction modes and filters that depend on exact pixels of neighboring blocks. AV1 (and all modern video codecs) has such dependencies. Removing compression features that prevent progressive rendering could make compression worse, and overall it may not be better than adding an extra fake image.

@kornelski
Copy link

@wantehchang It looks like the mailing lists are closed and only members (presumably big corporations) can join? As an individual implementer, can I participate in these discussions?

@paperboyo
Copy link

Boo, thanks for the explanation, @kornelski. Then, as a user, I vote for the users to have as little control over the make up of this extra frame as (in)humanly possible, let’s just decide this in the standard and disallow encoders any “creative” control over that.

It would be rigid, not take connection quality into account and so on, but still better than having to write tools to strip the oversized extra frame, tools to replace some company logos with your own, tools to remove company logos from images one got from the web, tools to update the extra frame other tool forgot to update upon resaving, tools to stick multiple AVIFs with exactly the same main image but differently-sized extra frames into a srcset to be served depending on network conditions and so on. And better than having to write (and read) all the blog posts about how to use them tools and what’s best practice with this extra frame, instead of having to read (and write) blog posts about the art and science of optimising the proper images.

Or, maybe, I’m wrong. Maybe I’m unnecessarily pessimistic. I, for one, wouldn’t abuse them, no way! 😜

@verlok
Copy link

verlok commented Sep 10, 2020

Good idea!
If it's not feasible in the AV1 file forma, we could just use the small 5k file as placeholder while we load the full image, just like @jakearchibald did on his demo.
BUT it would be better if it could be embedded in the file format itself.

@wantehchang
Copy link
Collaborator

@wantehchang It looks like the mailing lists are closed and only members (presumably big corporations) can join? As an individual implementer, can I participate in these discussions?

Hi Kornel,

Yes, only employees of AOM members can join the mailing lists of AOM working groups. I think companies of all sizes can join AOM, but the details don't seem to be published. If you work for a company, you can have the company fill out this application form to find out more: http://aomedia.org/membership/

One thing I can do is to ask Leo Barnes to post his progress in this issue.

@leo-barnes
Copy link
Collaborator

Hi,
I'll try to put together a summary of the discussion during/after the weekend. There are a bunch of different use-cases for progressive decoding and a bunch of options available depending on which tradeoffs you want to make.

@leo-barnes
Copy link
Collaborator

This is going to be a pretty long comment given that this is still under very active discussion as of now. The conclusion so far is that the HEIF/MIAF/AVIF specifications as well as AV1 has pretty much everything in them that is needed. The missing parts are mainly clarifications on exactly how it should work and recommendations/best practices on how files should be structured as well as how parsers and decoders should interpret them.

Use-cases

To start with, it's helpful to outline the two scenarios where progressive decoding is currently in use for JPEGs:

  1. When downloading an image with the intent of displaying it at full resolution and wanting to show progressive updates if the connection is slow.
  2. When decoding an image at less than full resolution, which may allow not loading the full file. This is a more general use-case and may happen both while downloading an image and when it's already downloaded.

Personally I don't consider the first scenario (which I'll call S1) to be extremely important these days. You can get more or less the same functionality by simply displaying the thumbnail until the full image has been downloaded.

Worth noting is that there has been very little scientific study on whether people actually prefer progressive or sequential decoding over slow connections. Facebook has run multiple internal experiments over the years to answer the question, but the conclusions were kind of mixed. In the cases where users preferred progressive rendering it couldn't be determined if this was actually due to the progressive updates or simply due to the progressive images being smaller in size and therefore downloading faster.

S1 is only really an issue when you have very slow download speeds. This is still the case in some places in the world, but is becoming less and less of an issue. Added to that, most monitors are not actually very high resolution. Some googling seems to indicate that most monitors tend to have 1920x1080 pixels or less. So only small to medium sized images will ever really be displayed at full resolution. For small images you will probably not really notice progressive updates. If you immediately want to show something on screen, you can use the thumbnail and simply display the full image once it's fully downloaded.

This leaves medium sized (typically smaller than 1080p) images as the main scenario where progressive updates would potentially be noticeable on slow connections. Here you can again show the thumbnail on screen until the full image has been decoded, but it may be low resolution enough that it won't look very pretty. But neither will only the first scans of a progressive JPEG.

Which brings us to the second scenario (S2). This is the case any time you want to display a large image on a small screen (or part of a screen). For typical camera images, this is almost always the case unless the user has a 4K display (or unless the user has zoomed in, in which case you only want to download/decode that region of the image rather than the full image). For a view that is showing a wall of pictures it will almost be guaranteed to be S2 unless the content provider has downscaled thumbnails that are shown instead.

With progressive JPEGs, you would in this case usually not need all the scans in the file. When downloading, this is very beneficial since you don't need to download the full file. When in local storage, decoding is faster and uses less memory.

The reason why it's important to outline S1 and S2 is that there are multiple ways of accomplishing progressive decoding. Some of them are better for S1, and some of them are better for S2.

Solutions

With the use-cases out of the way, how would we actually do progressive decoding in AVIF? There are basically three ways:

  1. Use thumbnails.
    This is very simple and something that MIAF (on which AVIF is based) recommends all main images in files to have. Thumbnails solve S1 reasonably well in that something can typically be displayed very quickly. The main drawbacks of thumbnails are:

    • The thumbnail may be too small to be useful for S2.
    • There is some coding redundancy since the thumbnail is not used for prediction in the main image.
    • Thumbnails are assumed to be lower resolution than the main image. You can't (or are not really supposed to) have thumbnails that are the same resolution but lower quality.
  2. Use independent images in an 'altr' group.
    This is basically exactly what jakearchibald is asking for. You have two items in the file. The first is the lower quality and/or lower resolution image, the second is the full quality image. This can work for both S1 and S2.
    Both items are placed in an 'altr' group with the (full quality) item placed first in the list. Decoders are supposed to pick the first item in the 'altr' group that it can decode and matches it's requirements. If the first item in the list has not yet been downloaded but the second has, the decoder is supposed to use that instead. Which in this case means that the lower quality item is used until the full quality item is available.
    The main drawback of this method is that the different items in the 'altr' group are independent which leads to some pretty severe coding redundancy if the lower quality/resolution items are large.
    This was also discussed with browser developers when MIAF was initially written. The conclusion from those teams were that they in general preferred using separate files rather than have multiple independent assets in the same file.

  3. Multi-layer coding (but no extra info on the container level)
    AV1 supports multi-layer coding both in the form of scalable quality and scalable resolution. This means you can create an elementary stream that gives you progressive refinement either in the form of layers of progressively better quality, or layers of progressively larger resolution. If such a stream is placed in an AVIF container, you would be able to satisfy S1 in much the same way that progressive JPEGs do.
    But since no extra information has been added to the container, it doesn't know which parts of the elementary stream contains which layers or which potential alternative representations are available. This makes it relatively tricky to satisfy S2. It can probably be done for a single non-derived image item if the AVIF parser can also parse and understand the properties of the AV1 elementary stream, but it's going to be very non-trivial for grids. The container also doesn't know up front which parts of the elementary stream to download and would instead have to sequentially download and parse in chunks until it has enough data.
    The main drawbacks of multi-layer coding are:

    • Coding complexity. Compared to a single layer, it's going to take more cycles to decode a multi-layer stream.
    • For other codecs, multi-layer support is pretty limited. It's normative in AV1 and existing software implementations support it so this should not be an issue.
  4. Multi-layer coding with all/some layers exposed on the container level
    HEIF has the 'lsel' (LayerSelector) property which allows you to create individual items for the possible outputs of a multi-layer stream. This means that you can create an explicit 'altr' group containing the different layers in an elementary stream. The parser/decoder decodes the layers that are available.
    The benefits of exposing the layers on the container level are:

    • The layer locations are specified in the container, so the parser knows exactly which parts of the file need to be downloaded. This is great for S2.
    • This works well for derived items in the form of grids. One grid can be created of the first layer of all tiles, another of the second layer. Both grids are placed in an 'altr' group. The parser can easily reason about which layers and tiles are available without having to parse any of the individual AV1 streams.
      The main drawback compared to 3 is that the AVIF container writer needs to be a bit more complicated.

The main part missing from 4 is exactly how 'lsel' is supposed to interact with AV1. @cconcolato has opened issue #103 to track that.

I personally think 4 is the best way of accomplishing progressive refinement since it allows for more use cases. The container parser should ideally not need to be able to parse codec elementary streams to reason about a file. I would prefer a slightly more complicated AVIF writer to a much more complicated AVIF reader.

@jakearchibald
Copy link
Author

jakearchibald commented Sep 17, 2020

Thanks for the update!

S1 is only really an issue when you have very slow download speeds. This is still the case in some places in the world, but is becoming less and less of an issue.

I think there's still a benefit at 3g-like speeds. Back when commuting was a thing, I'd frequently end up on the train's wifi, which was terrible, or on mobile, which was spotty/slow during the commute. This was a train to London, so I don't think this is just an emerging markets issue.

The conclusion from those teams were that they in general preferred using separate files rather than have multiple independent assets in the same file.

I'm not sure how that works on the web. If I ask the server for two files, the 'preview' version and the full version, the server would need to be smart enough to know they're part of the same overall image and send them in sequence. Otherwise, you can end up receiving the preview after the full version. Or, the server might chunk them and send them in parallel, but bandwidth spent on the full version rather than the preview is kind of a waste. Servers are pretty bad at serving content in the most optimal order today.

The client could avoid this by requesting each in sequence, but that creates a delay as it has to create a separate request once it's received the preview.

With a single file, you don't have the sequencing issues, since the order of bytes in the file dictates that. It also means the browser can make decisions like "I don't need to decode the preview, because I already have the whole file".

The layer locations are specified in the container, so the parser knows exactly which parts of the file need to be downloaded.

If I've understood this correctly, I worry about the performance. It means the browser would have to:

  1. Start downloading the image.
  2. See it's an AVIF, start parsing.
  3. Realise the bit of the image it needs is elsewhere in the file.
  4. Terminate the stream.
  5. Make range request for the bit of the image it does need.

I think in many cases this would be slower than downloading the whole file.

For the web to benefit from it, I think the data needs to be delivered in a single response, where the data is ordered lowest-resolution/quality first.

@cconcolato
Copy link
Collaborator

The conclusion from those teams were that they in general preferred using separate files rather than have multiple independent assets in the same file.

I'm not sure how that works on the web. If I ask the server for two files, the 'preview' version and the full version, the server would need to be smart enough to know they're part of the same overall image and send them in sequence. Otherwise, you can end up receiving the preview after the full version. Or, the server might chunk them and send them in parallel, but bandwidth spent on the full version rather than the preview is kind of a waste. Servers are pretty bad at serving content in the most optimal order today.

I think Leo was referring to this part of the MIAF specification:

Master image items that are not the primary item and not required for generating the output image of the
primary item may be present, but should be present only if it is expected that the application environment
could make use of multi-branded files efficiently. It is recommended that alternative coding formats be
represented by a choice of image file in a higher-level context (e.g. the picture element in HTML5).

It don't think it was meant for progressive decoding.

Anyway, to avoid any confusion, we're looking at clarifying how to do 3 and/or 4. In both cases, we're talking about having a single file. In both cases, the bits can be ordered such that you don't have sequencing issues. The difference are that:

  • in 3, writers would be simple (just copy a layered image into an item), clients would be simple, but limited (one download until you have enough).
  • in 4, writers are a bit more complex (they have to parse the layered image and signal layers at the container level), clients can still do the simple case but can use that information in cases where the client wants to do multiple byte-range requests to fetch the data (e.g. displaying only the high quality in a part of a large panorama or NASA image).

For the web to benefit from it, I think the data needs to be delivered in a single response, where the data is ordered lowest-resolution/quality first.

Then you would simply rely on (3) or the basic feature of (4).

@leo-barnes
Copy link
Collaborator

Thanks for the update!

S1 is only really an issue when you have very slow download speeds. This is still the case in some places in the world, but is becoming less and less of an issue.

I think there's still a benefit at 3g-like speeds. Back when commuting was a thing, I'd frequently end up on the train's wifi, which was terrible, or on mobile, which was spotty/slow during the commute. This was a train to London, so I don't think this is just an emerging markets issue.

Absolutely. But is progressive quality refinement better than initially showing the thumbnail and then either waiting until you have the rest of the image or sequentially fill in more quality in raster scan order?

The research done so far is very inconclusive. There's even been some study saying progressive refinement is actually worse since it's very hard to tell when the image is fully downloaded/decoded.

I would actually say that the thing to optimize for in this case is smaller files, or if possible, make sure to download only as much of the file as needed.

The conclusion from those teams were that they in general preferred using separate files rather than have multiple independent assets in the same file.

I'm not sure how that works on the web. If I ask the server for two files, the 'preview' version and the full version, the server would need to be smart enough to know they're part of the same overall image and send them in sequence. Otherwise, you can end up receiving the preview after the full version. Or, the server might chunk them and send them in parallel, but bandwidth spent on the full version rather than the preview is kinda a waste. Servers are pretty bad at serving content in the most optimal order today.

The client could avoid this by requesting each in sequence, but that creates a delay as it has to create a separate request once it's received the preview.

With a single file, you don't have the sequencing issues, since the order of bytes in the file dictates that. It also means the browser can make decisions like "I don't need to decode the preview, because I already have the whole file".

I'm neither a web developer or a browser developer, so I can't really give you very much detail unfortunately. I think what they were saying was that if a server has multiple versions of an asset with different sizes, there are ways of specifying this in HTML so that the browser knows that there are multiple choices and automatically downloads the best asset for the given device/monitor. They said this was preferable to having multiple assets in the same file since the browser would then have no other choice than to download everything.

The layer locations are specified in the container, so the parser knows exactly which parts of the file need to be downloaded.

If I've understood this correctly, I worry about the performance. It means the browser would have to:

  1. Start downloading the image.
  2. See it's an AVIF, start parsing.

Doesn't the browser know the MIME type of the asset up front?

  1. Realise the bit of the image it needs is elsewhere in the file.
  2. Terminate the stream.
  3. Make range request for the bit of the image it does need.

I think in many cases this would be slower than downloading the whole file.

There is nothing stopping you from downloading the full file. In fact, I definitely think you should lay out the layers and tiles in the file so that you get progressively better quality as you download sequentially. But exposing the layers in the container gives you the choice of easily stopping once you have good enough quality.

For the web to benefit from it, I think the data needs to be delivered in a single response, where the data is ordered lowest-resolution/quality first.

There are implementations of progressive decodes for progressive JPEGs that do exactly what I'm talking about already. They download the first N scans of the image and use that when the image is shown in a grid/thumb view, but once a user taps the image the rest of the scans for that file are downloaded so it can be viewed in full quality. The benefit of AVIF over progressive JPEG here is that with AVIF you would know once you've parsed the 'meta' box exactly how much of the file you need to download for the different viewing modes.

@kornelski
Copy link

kornelski commented Sep 17, 2020

@leo-barnes Thank you for the information. I wasn't aware that AV1 has an option of multi-layer coding. I'll have a look at this.

I would like to contest your assessment about the S1 use-case. At Cloudflare we're seeing measurable improvement from progressive rendering, in both synthetic benchmarks like SpeedIndex, as well as real-world engagement, even correcting for file size difference in JPEG.

If you assume images load progressively, but strictly one image after another, then indeed it would be hard to notice an improvement, because the connection would have to be so extremely slow that even a single image takes noticable time to load.

However, the difference is that on the web multiple images load in parallel. In HTTP/2 we even take extra care to multiplex their streams on the network level to increase this parallelism. Because of multiplexing across all images on the page, we can get useful progressive rendering even from small images that are few KB large. Images use more than a half of average website's bandwidth, and we can make them appear only after about 12% of image's data is loaded, so visually it's like improving connection speed by over 40%.

https://blog.cloudflare.com/parallel-streaming-of-progressive-images/

What we do is something like this:

  1. Send exactly the first 177 bytes of every JPEG file to the browser (so that it knows dimensions)
  2. Send CSS to the browser (so together with image dimensions it can do layout)
  3. Then send DC scans of all JPEG images on the page to the browser (it can paint all images now!)
  4. Then send JavaScript and other resources.
  5. Then send the rest of images, all in parallel.

This means there can be a delay between steps 3 and 5 when the browser is busy downloading and executing JavaScript. The step 5 is also parallelized, so progressive rendering is used for the entire duration of page loading time.

@cconcolato
Copy link
Collaborator

@kornelski your algorithm for AVIF would translate into:

  1. Send all the bytes up to the end of the meta box
  2. (unchanged)
  3. Send the first layer of the image
  4. (unchanged)
  5. (unchanged)

Correct?

For Step 3, I'm assuming you are parsing the JPEG to determine the end of the DC scans. Are you willing to parse the AV1 bitstream to know where the first layer ends? If not, would you favor an approach with multiple items (one per layer)? or would it be sufficient if there was only 1 item but with one extent per layer?

@leo-barnes
Copy link
Collaborator

However, the difference is that on the web multiple images load in parallel. In HTTP/2 we even take extra care to multiplex their streams on the network level to increase this parallelism. Because of multiplexing across all images on the page, we can get useful progressive rendering even from small images that are few KB large. Images use more than a half of average website's bandwidth, and we can make them appear only after about 12% of image's data is loaded, so visually it's like improving connection speed by over 40%.

https://blog.cloudflare.com/parallel-streaming-of-progressive-images/

Thanks, I'll read the link. My point though is that you don't really need progressive decoding to do the same with AVIF, you could just start using the thumbnails in the files (which the spec recommends should be placed before the main image data). The thumbnail will most likely allow you to show a full low quality picture with less than 12% of the data.

Progressive JPEGs have the benefit that they usually allow for smaller files than baseline JPEGs since it allows for separate optimized Huffman tables for each scans. The same is generally not true of multi-layer coding with other codecs. There you will usually get slightly larger sizes if I've understood it correctly. Adding a thumbnail also makes the file larger of course.

So, you have three options:

  1. A single layer file with no thumbnail
  2. A single layer file with a thumbnail first
  3. A multi layer file without a thumbnail

Both 2 and 3 will end up being larger than 1. Both 2 and 3 will allow you to show something with only a subset of the data. Option 3 has higher decoding complexity. Given option 2 or 3, which one is best? I would guess the one which ends up consuming less data, but that depends on exactly how you layer or the thumbnail size. But if decoding speed is crucial, option 2 would be better.

@cconcolato
Copy link
Collaborator

FYI, a tentative example of AVIF file containing 2 layers is here. The stream was not encoded to demonstrate progressiveness, so don't look at the quality or size of each layer. The intent is simply to verify/have implementation support. Comments on the file should be made in issue #40

@kornelski
Copy link

kornelski commented Sep 17, 2020

Re #102 (comment) Correct.

I'm not sure yet which approach is better. I'll need to investigate how well layers work and how easy it is to parse AV1 payload.

The thumbnail approach and extents are definitely easy to implement. It'd be just a few lines of code in my current codebases.

@leo-barnes
Copy link
Collaborator

@cconcolato

For Step 3, I'm assuming you are parsing the JPEG to determine the end of the DC scans. Are you willing to parse the AV1 bitstream to know where the first layer ends? If not, would you favor an approach with multiple items (one per layer)? or would it be sufficient if there was only 1 item but with one extent per layer?

The problem with only 1 item but multiple extents is that you don't actually know that each extent corresponds to a layer until you parse it. You can hope for it, but there is absolutely nothing guaranteeing it. That is why I'm pushing pretty hard for marking everything up correctly at the container level.
If you're going put the extra effort in of doing multi-layer coding, taking the extra step of creating an easily parseable container should be nothing in comparison. But that is very much my personal opinion. 🙂

@leo-barnes
Copy link
Collaborator

The problem with only 1 item but multiple extents is that you don't actually know that each extent corresponds to a layer until you parse it. You can hope for it, but there is absolutely nothing guaranteeing it. That is why I'm pushing pretty hard for marking everything up correctly at the container level.
If you're going put the extra effort in of doing multi-layer coding, taking the extra step of creating an easily parseable container should be nothing in comparison. But that is very much my personal opinion. 🙂

The other reason why I think everything useful should be on the container level is that I don't think Cloudflare should need an AV1 parser on the server in order to efficiently serve up AVIF images. Otherwise they may very well end up having to have an AVC, HEVC, AV1 and AV2 parser just to be able to figure out what's in a file. Ideally it would be enough to have a general purpose HEIF parser.

@jakearchibald
Copy link
Author

jakearchibald commented Sep 18, 2020

@leo-barnes

But is progressive quality refinement better than initially showing the thumbnail and then either waiting until you have the rest of the image or sequentially fill in more quality in raster scan order?

The proof of concept I hacked together was just a low-quality half-resolution version followed by a full version.

I guess that could be done with a thumbnail. If browsers start displaying the thumbnail before they have the full image, I'd definitely optimise my thumbnails for early-render on a web page, and publish tooling to help others to do the same. This would make them crappy thumbnails, but if you're happy with that, so am I.

I don't think scan order is as good, as @kornelski says.

I'm neither a web developer or a browser developer, so I can't really give you very much detail unfortunately. I think what they were saying was that if a server has multiple versions of an asset with different sizes, there are ways of specifying this in HTML so that the browser knows that there are multiple choices and automatically downloads the best asset for the given device/monitor. They said this was preferable to having multiple assets in the same file since the browser would then have no other choice than to download everything.

Ah, they're talking about responsive images. Yes, this is ideal for cases where the browser wants to pick the right file for the number of pixels needed. However, I think this is orthogonal to progressive/preview rendering.

  • Responsive images: "Of these resources, I will pick the best one for me"
  • Progressive/preview rendering: "While downloading that resource, I can represent it in a useful way without downloading the whole resource"

Both are useful at the same time.

2. See it's an AVIF, start parsing.

Doesn't the browser know the MIME type of the asset up front?

Not before making the request. It knows the type if it's specified in the response Content-Type header, however all browsers ignore this and just sniff the start of the resource instead.

I definitely think you should lay out the layers and tiles in the file so that you get progressively better quality as you download sequentially. But exposing the layers in the container gives you the choice of easily stopping once you have good enough quality.

Ah, cool, I think I got confused when you said "Both items are placed in an 'altr' group with the (full quality) item placed first in the list", and assumed that meant the full quality data would appear first in the file, but I see that the list and data order are separate.

@leo-barnes
Copy link
Collaborator

@jakearchibald

Ah, cool, I think I got confused when you said "Both items are placed in an 'altr' group with the (full quality) item placed first in the list", and assumed that meant the full quality data would appear first in the file, but I see that the list and data order are separate.

Yeah, I know, trying to explain it in words is always tricky. In short, the 'altr' group tells the decoder that all of the listed items are alternative interpretations and that it should pick the first item in the list that it can decode and matches its requirements. This is completely separated from the order the items are stored in the file (which is handled by the 'iloc' box).

So you want to mark the full-quality item as the preferred item, so that is what is decoded if the decoder has all the data.

The other main use of the 'altr' group in the spec is to give alternative representations in case the decoder can't support all of them. You could mark an HDR item as the first item, but give an SDR representation as a backup in case the decoder/system can't display HDR content (or can't decode 10-bit content).

@X-Ryl669
Copy link

@jakearchibald I don't think they are so different in fact. In an ideal world, one would write a single <img> element, and let the browser deal with what is the best picture to fetch without wasting resources. srcset are a manual chore and a pain to maintain (if you change a picture on a website, you need to re-render all the different items/scales), if you change the main picture displayed size in CSS, you need to update the HTML's srcset rules and so on.

In the end, if AVIF was able to say: "Hey, I have 3 pictures in this file, one is 500px wide and it's located in [0-X] part of the file, the second is 1000px wide (and it's located in [0-X2] part of the file) and the last one is 1500px wide (and it's located in [0-X3] part of the file)", the browser could make the decision to stop downloading the file when it reaches X bytes if the current image box is only 500px wide or less.

Being a single format/file, makes it's impossible to miss re-rendering the subscales when editing the file. It also re-adjust magically when the CSS modifies the image's box width.

As an added bonus, if the browser, once it has downloaded X bytes, starts to decode the 500px picture and stretch it into a 1200px wide box, you get progressive feature for free (no need to have additional thumbnails to generate, no javascript). When the browser gets the X2'th byte, the picture quality bumps as it's now a 1000px that's stretched to 1200px, and so on until the whole file is downloaded and final quality is displayed.

In your example, there is the low "quality" of the initial image (that's not just a quality loss of upscaling the thumbnail), it's a quantization issue (quantifier were too high). I'm not sure about if this is really a plus as it's very distracting (too much details are lost here, so when the new picture bumps up, it's like 90's web <blink> issue, or FOUC).

If AV1 allows efficient storage by re-using the lower resolution pictures for the higher resolution pictures, it'd be a real killer feature, but it's not really required in current web standard since most of the content creators are already creating multiple resolution pictures now and hurt their heads with their always breaking srcset rules.

@kornelski
Copy link

kornelski commented Jan 13, 2021

@X-Ryl669 Such schemes have been proposed and tried many times. It's not as great as it seems.

  • Hierarchical encoding isn't very efficient (it was even in the 1992 JPEG spec, and failed to get traction). Highest frequencies are expensive to encode, and deltas between images at various resolutions don't help with that, and even get extra work due to needing to undo anti-aliasing.

  • Browsers can't stop downloading a file. They can send a reset packet to the server to tell it to stop sending more data. Due to latency and bufferbloat, this often doesn't do anything, as the rest of the data is already on the wire. If you try to avoid that by having browser request the right range in the first place, then you have a latency penalty, because the browser first needs to get an index of ranges to request. Also at time of requesting images, browsers may not even know what size they need. That's why image client hints are imprecise, and we've ended up with srcset/sizes.

@X-Ryl669
Copy link

X-Ryl669 commented Jan 13, 2021

Hierarchical encoding isn't very efficient

Yes, I guessed that. But having all resolutions in a single file is a good thing, IMHO (even if it's wasting more resources than a perfect format), it simplifies user's work (creators, publishers and so on).

Browsers can't stop downloading a file

That's not true. A browser can stop downloading anytime. It can close the socket, it can stop reading from it (so, in effect the TCP window will stop sending ACK). In the worst case, as you said, the server will have sent a TCP's window of data but it's still better than downloading the complete file (unless the large picture file is in < 64kB range, but in that case, I'm pretty sure you don't bother using srcset anyway).

Even with srcset and sizes, your argument applies. In HTTP2, when the browser fetches the HTML document, the pictures can already be currently streaming (even before the browser parsed the HTML).

IMHO I would consider an HTTP2 server sending pictures (those from a srcset) as prefetched stream a bug, and there is nothing a image format can do to solve it.

So, in a the current web, I wouldn't expect a server to send a picture if it wasn't requested explicitly. In that case, nothing prevent the browser to only ask for the header of the AVIFv2 file if every byte counts.
Currently, there is already a One Round Trip latency when using srcset, since the browser must compute the best size/file to request.

So, having a format that's storing, in its header, that there is one "stop" point at X-th byte is just a signal to tell the browser, "hey you've fetched X bytes, stop fetching by now", it'll still be beneficial on slow connection since in that case the MTU will be low, and the TCP window will be low as well.

Ideally, if we could entangle all of this, only the AVIF's header would be sent/prefetched by the HTTP2 server, no srcset in HTML, and the browser would then only range request the element of the files it needs. So no additional latency with current world, but a simplified usage for all final users.

@jakearchibald
Copy link
Author

That's not true. A browser can stop downloading anytime. It can close the socket, it can stop reading from it

This is inefficient in HTTP/1.1, and really really inefficient and has side-effects in HTTP/2. In HTTP/2 a single connection is used to send multiple resources multiplexed. If you close the socket you're ending many responses, and wasting time having to reestablish the connection.

@jakearchibald
Copy link
Author

jakearchibald commented Apr 8, 2021

Has there been any progress/discussion on this? In case it's persuasive, here's an AVIF at 108kB (with the quality high enough to preserve most skin texture):

high-quality

But here's a 3kB preview that could go at the start (1/2 resolution, minimum quality):

preview-1

Maybe that's too artifacty for some. What if, the preview could come with 'blur' metadata? Here's a 1/4 res minimum quality version at 2kB:

preview-blur-16 (1)

But here it is with a 16px blur:

preview-blur-16

It seems like this would be an excellent "loading" preview, if the decoder could be told "apply a 16px blur to the output", just for the preview image.

Image from https://unsplash.com/photos/idP6ct9jkmk

@jakearchibald
Copy link
Author

Having said that, here's a 1/4 resolution version with the blur applied in pre-processing, also 2kB.

_Users_jakearchibald_Downloads_jarritos-mexican-soda-idP6ct9jkmk-unsplash png

It's kinda similar in quality & size, so maybe it doesn't need to be a post-processor (as long as the browser does decent upscaling)

@leo-barnes
Copy link
Collaborator

You may want to take a look at this comment:
#131 (comment)

It outlines how to use the new properties that have been added to the current draft.

@paperboyo
Copy link

if the decoder could be told "apply a 16px blur to the output"

&

so maybe it doesn't need to be a post-processor

In light of the arguments in my doom-laden comments above, I would vote for this metadata to be hard-coded (relative to image size?) and for the blurring to happen automagically in the post-processing. That would limit user control over the makeup of these frames (eg. discourage ppl from sticking company logo or whatever there).

@jakearchibald
Copy link
Author

@leo-barnes I don't understand the terminology enough. Could you summarise this in terms of the feature it enables? Are we looking at a preview image, or a progressive decode like JPEG/JXL?

@leo-barnes
Copy link
Collaborator

leo-barnes commented Apr 9, 2021

@leo-barnes I don't understand the terminology enough. Could you summarise this in terms of the feature it enables? Are we looking at a preview image, or a progressive decode like JPEG/JXL?

@jakearchibald
Progressive decode. It enables using AV1 multilayer coding. In short you would create an image item consisting of multiple layers of increasing resolution and/or quality. If the item is not marked as selecting an explicit layer, the spec clarifies that the renderer is expected to show the final layer, but may display preceding layers if it wants to do progressive refinement.
(If a specific layer is selected, this means that the layers are not suitable for progressive refinement, for example if the layers are showing different viewpoints.)

The a1lx boxes are there to help tell the renderer when it has received full layers and can potentially update the displayed image.

@jakearchibald
Copy link
Author

@leo-barnes is data from the preceding passes required to render the final pass (like progressive JPEG), or are they independent images (like what I'm proposing in this thread)?

@leo-barnes
Copy link
Collaborator

@jakearchibald

@leo-barnes is data from the preceding passes required to render the final pass (like progressive JPEG), or are they independent images (like what I'm proposing in this thread)?

This is true multi-layer coding, so they are not independent images. Depending on how the stream is created you would most likely need to decode all layers to get the final output.

From what I could tell, that is what this thread was asking for. Having a lower resolution independent preview image before the main image is already very well supported in the form of thumbnails. If that is not sufficient, can you clarify what you're missing?

@jakearchibald
Copy link
Author

The way I see it:

JPEG-like progressive encoding:

  • Pro: encoder can just handle it for you (hopefully).
  • Pro: Less wasted data (hopefully).
  • Pro: Multiple passes, so there's a feeling of progress during the download.
  • Con: More encoder/decoder complexity. But if folks are up for doing the work, great!
  • Con: Higher decode cost?

Preview image:

  • Pro: encoder can just handle it for you, but:
  • Pro: different tradeoffs like blur pre-processing & different resolutions could be used.
  • Pro: If the image is downloading fast enough, decoding the preview could be skipped.
  • Con: Wasted data, since the preview doesn't contribute to the final image. As such, it doesn't make sense to have more than one 'preview' pass.

If the result of slow-decoding the multi-layer version you're talking about is a smooth blurry preview like #102 (comment) (as in, not blocky, few ugly artifacts), then it's the winner!

In terms of the preview image, I thought the consensus was that a blurry preview would be an abuse of something intended for a thumbnail.

@jakearchibald
Copy link
Author

I would be happy with a preview image solution fwiw.

@jakearchibald
Copy link
Author

https://avif-blur-preview.glitch.me/ - I created some more demos with a post-processing blur filter

@skal65535
Copy link

Thanks @jakearchibald , these are quite nice examples!

On 'city', one can see some rainbow artifacts due to the heavy compression of the chroma planes. The could be addressed with special encoding parameters dedicated to a 'preview' mode, i guess.

Overall, spending 2kb on these 300-400kb images seems like a good trade off.
The unknown is the extra decoder complexity these previews would entail.
Granted, if there's only 1 extra scan (the 'preview'), it makes things easier than having potentially any arbitrary refinement layers.

@jakearchibald
Copy link
Author

jakearchibald commented Apr 14, 2021

I updated the city example & threw more bits at it.

As for the chroma thing, I'd love AVIF to have a seperate chroma quality for cases like this. Also, when targeting high density screens, I feel that chroma quality could go lower and still be ok. It'd also be interested to compare lower chroma quality vs subsampling.

@kornelski
Copy link

Granted, if there's only 1 extra scan (the 'preview'), it makes things easier than having potentially any arbitrary refinement layers.

Indeed. The big advantage is that decoders can simply ignore existence of the preview and still get a valid image in the end. AFAIK with refinement layers approach all the extra complexity would be required for all decoders, even decoders that don't care about progressive download.

@leo-barnes
Copy link
Collaborator

@jakearchibald

In terms of the preview image, I thought the consensus was that a blurry preview would be an abuse of something intended for a thumbnail.

That seems like a perfectly valid use case to me. You could even have multiple thumbnails of various sizes with the smallest one meant to handle things like displaying a file icon in a file manager and the larger ones meant for things like this use-case.

We discussed this a bit in the last meeting. I'll try to give a longer summary later on, but the conclusion was that we think the spec contains everything needed to solve this use case now. Both separate preview images in the form of thumbnails and "true" progressive refinement in the form of multilayer streams is now supported (in the spec). The one thing missing is #104 which should help give some examples on how files should be structured and which boxes are recommended/required.

It's now up to framework maintainers to actually implement the functionality. One thing that was noted was that avifenc currently does not have any functionality to add downscaled thumbnails to an AVIF file. Opening an issue for adding that would probably be a good first step to improve tooling support.

@jakearchibald
Copy link
Author

Thanks! Was the idea of blurring being a post-processing filter discussed at all?

@vincentbernat
Copy link

vincentbernat commented Apr 16, 2021 via email

@jakearchibald
Copy link
Author

They could, but where would that be spec'd? Also, it feels like the blur amount should be metadata that sits with the thumbnail.

@leo-barnes
Copy link
Collaborator

@jakearchibald

Thanks! Was the idea of blurring being a post-processing filter discussed at all?

Yep. I'm going to listen to the recording and write up the full meeting notes beginning of next week. I think the conclusion was that post-processing really depends on the exact use case and probably should not be mandated by the spec. Some use-cases (like this) may want to up-scale the thumbnail to the main image resolution and do some blurring. Other use-cases will probably want to use the thumbnail in its native resolution, in which case you probably do not want any blurring. HEIF currently doesn't contain any functionality for specifying scaling or other image processing steps apart from crop, rotate and mirror. Adding something like this would be a pretty major addition.

@wantehchang mentioned that applying blur when upscaling images in browsers seems to be standard practice. It's something that could be done for thumbnails of all image formats, which makes it feel more like an implementation detail than something that should be specified in the spec.

@jakearchibald
Copy link
Author

Cheers! There might be a risk of the blur being different in different browsers, but maybe that's ok. Exciting stuff!

@leo-barnes
Copy link
Collaborator

There might be a risk of the blur being different in different browsers, but maybe that's ok.

I think that's ok. Not specifying how it should be done could actually be considered a feature. I've seen a fair amount of articles on ML-based super-resolution that automatically filter out compression artifacts. Using something like that instead of a regular upscale+blur could potentially give you even better results. But trying to specify something like that in a spec would be a nightmare. 🙂

@leo-barnes
Copy link
Collaborator

leo-barnes commented Apr 20, 2021

I have opened an issue for avifenc to support adding low quality thumbs here: AOMediaCodec/libavif#605

With that I think we have added everything needed to consider this issue resolved. Please feel free to re-open if there's something you think we missed.

@ewanm89
Copy link

ewanm89 commented Jun 21, 2021

@jakearchibald

Doesn't the browser know the MIME type of the asset up front?

Not before making the request. It knows the type if it's specified in the response Content-Type header, however all browsers ignore this and just sniff the start of the resource instead.

Urm, no modern browser just starts sniffing the content, for a start if one uses a content-type header set to application/octet-stream the browser triggers to download as a file to save to disk instead (old web developer trick to force download to file of text, html... rather than render), Also I know several browsers complain about the content-type header not being a recognised codec for the source <video> tags. The <img> tag is old, so there might be some legacy stuff in that one still, that I do not know.

@jakearchibald
Copy link
Author

jakearchibald commented Jun 21, 2021

@ewanm89

It knows the type if it's specified in the response Content-Type header, however all browsers ignore this and just sniff the start of the resource instead.

Urm, no modern browser just starts sniffing the content

Mate, c'mon, at least read the thing you're quoting.

@ewanm89
Copy link

ewanm89 commented Jun 21, 2021

How is "all browsers ignore this and just sniff the start of the resource instead" hard to read?

@ewanm89
Copy link

ewanm89 commented Jun 21, 2021

@jakearchibald

I'm going to rewrite this, to make it clearer, but the issue is not my reading here.

Modern browsers pay attention to the Content-Type header.
There are 2 clear demonstrations of this:

  1. An old web developer trick to force a browser to download rather than render something is to use Content-Type: application/octet-stream
  2. On video tags, with a simple media source if the media is not sent with a valid Content-Type for that media, several browsers throws an error (I've had to fix this one in webserver configs several times).

Now in very early HTTP, Content-Type didn't exist, and was infered often from the tag, so there might be some legacy issue on img tags still. But this it before HTTP 1.0 where several features of MIME were integrated into the specification (I had to point this out to my CS professor who claimed HTTP didn't use MIME, it uses about 3/4 of it that is relevant but has been folded into the HTTP specifications.

@ausi
Copy link

ausi commented Jun 21, 2021

@ewanm89 I think the topic here is loading images in HTML pages via <img> and in that case the Content-Type does not matter (even in modern browsers).

I’m happy to discuss this further via Twitter DM if you like: @ausi
But I think it is not a good place here in this ticket to discuss this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests