It sounds like you are confusing/combining different methods of achieving 4K video in your description.
4K is about 8 MP (not MB), that is true. The part I think that you are missing is that it doesn't matter if the sensor is 1,000,000 megapixels and 2 feet in diameter, if it is capable of only using a 4K portion of the sensor, only that portion is active and it is recording native 4K without any downsampling being done. You aren't throwing away any data to achieve 4K because no additional data is being recorded. The catch here is that method crops the image relative to the sensor size and native focal length (hence why HQ mode on the
M2P drops from 77 degrees to 55 degrees FOV). This is completely different than using the entire sensor, and then pixel binning or skipping to get down to 4K, which is more what you are describing as that involves having the whole sensor active, but throwing away data to reduce both resolution and data throughput. Pixel binning is combining or averaging values of adjacent pixels to reduce effective resolution, and line skipping is literally just like it sounds - not reading certain lines on the sensor to reduce resolution - those methods allow you to maintain native FOV but at a cost of quality.
In FOV mode, what DJI does is something called Subsampling. This is basically pixel skipping. It allows DJI to maintain the native FOV (28mm equivalent / 77 degrees) while reducing the resolution of the sensor to 4K. Pixels are skipped when the sensor reads the data, and then that data is processed into what you see as a 4K image. The goal with all of these "workarounds" is to reduce the amount of data required for processing. DJI claims this method yields them a higher quality image than line skipping would have, and was the best 'compromise'.
The better way to do 4K, and what the
M2P apparently does not have enough processing power to do, is a full width sensor readout and then properly scaling the image down to 4K using all of the sensor data (which I think is also along the lines of what you're trying to describe). Yes, the resolution changes, but you are utilizing the
full sensor data to build each 4K frame. Properly processed, this yields much better quality video with reduced moire, artifacts, etc. however it requires a very fast sensor that can offload a very large amount of data very quickly, especially for 60fps. Then you need the processing power to deal with that data. You also maintain the native focal length when you do this, which is especially useful for interchangeable lens camera systems. The newer Sony RX100 cameras can do this because they use Sony's latest stacked BSI 1" sensors with integrated DRAM which have crazy fast readouts. Theoretically, DJI could have purchased these sensors and put them in the
M2P, but they would also have to process that data which is where they claim the bottleneck is.
So you aren't really wrong, I just think you were mixing up different ways of obtaining 4K which is admittedly not straightforward.