Not bad, but here's where you're going astray.
This is another installment of me droning on and on, but if you want to really understand this stuff, it'll be worth it.
A non-quad sensor still has a Bayer filter over the photodiodes, and each captures the light intensity of a single color, R, G, or B.
Each of these locations becomes an RGB pixel after the missing two channels are reconstructed by using neighboring pixels that captured that missing channel, via an interpolation algorithm. The simplest is nearest-neighbor averaging, and usually produces a decent result, but there are all manner of complications in the image that can defeat simple averaging and produce artifacts.
This is called demosaicing or debayering, and is a part of capturing images from all digital cameras (with some esoteric exceptions we'll ignore, as they're irrelevant to this discussion).
There are more sophisticated reconstruction algorithms that do a better job, some of the best being "content-aware", analyzing the content of the image and adjusting how it determines the 2 missing channels at each pixel.
I explain all this to make the following point: The recovered channels have an error range (error bars) that can be mathematically calculated. A non-quad image has errors at every pixel for the two reconstructed channels, just like the quad-Bayer capture does. The errors are just larger, and this is critically important –
for the same demosaicing algorithm — than for the simple Bayer filter.
So you can see where the problem is with the idea of a "true" 48MP image... what does that mean? Error-free isn't possible. Is there a particular error threshold you have in mind? I'm sure you see the problem.
The idea of a "true 48MP image" gets even more meaningless if you allow for different demosaicing algorithms. Suppose you apply a very sophisticated, compute-intensive demosaicing algorithm to the 48MP quad-bayer capture, and the simplest nearest-neighbor algorithm to a capture from a theoretical camera with the only difference being an ordinary Bayer filter instead of a quad.
The error bars for the missing channels can be less for the quad image than for the non-quad, resulting in a higher resolution higher color fidelity result than the non-quad capture, if it uses a far more sophisticated demosaicing algorithm.
Which one is the "true" 48MP image?
When Sony introduced the quad-bayer in 2018, demosaicing algorithms were all designed for a 2x2 bayer pattern, so didn't do the greatest job minimizing errors. Hence the reputation quad-bayer sensors acquired, and deservedly so.
Fast forward to 2023. A lot of R&D has improved quad-bayer demosaicing – a lot. Computing power has advanced much too, making it possible to do much more than simple averaging in real time in GPUs, and even to some extent on-chip with some higher-end sensors. As we see with the new
A3, quad-Bayer captures are getting nearly as good as simple Bayer captures, even better if they can be demosaiced by a sophisticated algorithm.
If we define "true" to mean a pixel where all 3 color channels are captured directly, with no reconstruction, the closest would be a 48MP sensor with a regular Bayer filter capturing in 12MP, a 2x2 cluster of photodiodes representing a single pixel but with the RGGB Bayer filter pattern over them. Then, red, green, and blue get directly captured for the "pixel" and there is no demosaicing. 48MP captures would still require demosaicing, but with the more typical error size for the reconstructed channels.
Why use a quad-Bayer filter then in the first place? Low-light performance, which sensor manufacturers have determined is a bigger gain than eliminating channel reconstruction errors, and can more easily be addressed computationally than data that simply isn't there at all in low light due to limited sensitivity and dynamic range of the sensor.
The other reason, sadly, is resolution wars.