I've been thinking of compression for some time.
When looking at a BMP file in general it will consist of 24 bit strings for each pixel.
Considering a picture of 10 colours used, this will result in repetition of 10 variations of 24 bit strings.
I believe that in general any file can be compressed in a systematic manner to its 'best', without 'compression levels'.
There are 3 general ways to detect compression for a file.
1) Set frames
2) In a row
3) Throughout the file
Set frames is like taking the BMP and its 24 bit variations, and as there are 10 variations used, making each 24 bit sequence into 4 bits to fit all 10 variations, and use 4 bits instead for each 24 bit string.
In a row is like looking for repetition of the same sequences, such as one colour which repeats for a chunk of the file, and using some information to store its sequence, offset, repetition and remove it from the file completely (and later be able to restore it based on the offset).
Throughout the file is like a general huffman style system which will specifically target repetition of sequences in the file and rewrite these on a code tree of sequences which vary in length and the idea is the more frequent ones will use less bits than the less frequent ones and should allow for overall compression this way (and the code tree sequences cannot be confused reading the file from left to right).
While each way of detecting compression allows for compression, each way will generally get compression that other ways will not.
Set frames will be able to reduce the overall file.
In a row will be able to get large chunks in a section (such as an optimal method for a 1GB file of 0 bits)
Throughout will get general repetition of specific sequences and recode it.
Throughout (like huffman) when done alone can leave repetition of coded sequences in a row if the repetition is prevalent in a row (like a 1GB file of 0 bits).
To fix that, a scan for 'in a row' in advance can prevent this where possible, leaving the throughout scan to simply find repetition throughout and between gaps and not write a file that has repetition in a row from its source having it.
By shrinking frames based on sequences used into less bits to represent, the in a row scan will use shorter offsets, sequences and repetition data to be stored in general.
Also, the throughout one (like huffman) will be done on a shrunk file of generally organized and common data as well as without finding sequences in a row.
The overall effect of doing 'set frames' then 'in a row' then 'throughout' is that they will inherit and result in a generally compressed file with inheritance from previous scans.
The idea is that each one of the methods, which are called 'MODES' will be thorough in their mannerism.
For set frames, it will not simply look for all sequences in the file on the whole.
It will specifically target areas in the file which can be shrunk and known by its offset locations, such as a dynamic method of shrinking areas by its sequence variations into less bits.
For example, a file can have prevalence of 5 24 bit sequences in multiple areas like 0x20-0x100, 0x200-0x500, 0x1000-0x2500 etc - it will shrink those specific areas and associate them to the same 'section', and thus dynamically shrink areas in a manner which will produce the most overall compression.
For in a row, it will use sequences which are long enough in a row that compliment more compression if done in conjunction to throughout when followed, instead of 48 0 bits in a row which could potentially be eligible for detection of a sequence that can be stored in a row - it will look for the shortest version of a repetitive sequence in a row that is 'worth recording'.
For throughout, it will determine the sequences which will overall save the most bits in the file on the whole, blank their offsets out and do rescans in this manner to end up with the least number of the most bit saving sequences used to assemble a file. The code tree will be constructed not by frequency but on ordering the most bit saving sequences to use the least bits on the code tree.
As each method is considered optimized, there are no 'compression levels'.
It is 3 optimized algorithms of detecting compression which when run one after the other with priority will overall compress the file with inheritance.
Since each of the 3 MODES are looking for repetition and compression in multiple ways, this is generally for all files on the spectrum where repetition can be found.
For each section found by MODE 1, such as a particular section like the 24 bit strings of 5 variations in the multiple areas, all areas are treated as the same section, and when using the 'in a row' and 'throughout' MODES following are done on each section as though it is a continuous string, adjusting offset information where required.
The result is 'in a row' and 'throughout' are done on shrunk sections with specific common data to encourage more data to be found 'in a row' and for throughout to be done on common data.
This will have more compression that doing 'in a row' or 'throughout' on the whole file even if it was shrunk into sections since the 'in a row' is specifically targeting common data that has been detected in advance and treated as continuous in the file, and the 'throughout' is looking for repetition in an area targeted to have few sequences and prevalence of them.
When each MODE is done once, one after the other (and in each respective section), the result is something that can generally be streamable in terms of decoding on the fly.
When each MODE is repeated more than once where possible, the result is a file like a russian doll with a doll in a doll, and the file is the smallest doll and the data to make all the larger ones.
This can for example get more data 'in a row' if done twice based on the result of removing the initial 'in a row' data, and treating the file like an accordion to get more compression where possible using the 3 optimized MODES of detecting compression (which compliment).
I have a writeup (which needs some adjusting - my prediction for compression is too high in general in the writeup), but general pseudocode is there and an expectation of having an overall highly compressed file (without compression levels), having a highly compressed 'streamable' file and highly compressed 'archival' file.
As there are no compression levels, and each MODE will output the 'ideal' result being the least bit version to output possible, it can systematically compress files in general where possible and by multiple methods of detecting compression.
Additional information can be put in the header such as resolution/bitrate to allow for the data being worked on to be raw data such as raw image or raw audio data, for overall maximum compression results. Additional information such as minor information required for synchronization can be done for immediate decoding of specific offsets (files) when using the streamable scheme.
The concept is a systematic manner of compressing a file using multiple complimenting and inheriting methods of detecting compression, and only using a MODE where it is detected in advance that compression will occur - scans for MODE 1 and MODE 2 will be done and if only MODE 3 will find compression, that is the only one used.
For this reason, it is generally calculation intensive and requires generous temporary storage to do - the result is a generally optimized file which can generally be accessed immediately.
While the writeup can do with some adjustment as well as pseudocode and a manner of providing an accurate systematic representation of header or data for example, I'll put the writeup here.
I have also made a post here:
http://forum.codecall.net/topic/779...assist-in-developing-a-compression-algorithm/
If anyone is interested in taking this concept and coding the algorithms of each MODE and allowing them to work systematically based on the selected SCHEME, this could generally be used for compression as a tool and with an expectation of high results.
For the archival scheme and its nature of repeating methods, I don't think compression tools like rar/7z go as far as this, and for a general streaming file it is expected to be highly compressed (from each MODE being optimized for ideal results) and accessible.
Any other MODES to add which will compliment and inherit are also welcome
My programming skills are low, however a general outline and pseudocode of expected results is there.
Thanks.
When looking at a BMP file in general it will consist of 24 bit strings for each pixel.
Considering a picture of 10 colours used, this will result in repetition of 10 variations of 24 bit strings.
I believe that in general any file can be compressed in a systematic manner to its 'best', without 'compression levels'.
There are 3 general ways to detect compression for a file.
1) Set frames
2) In a row
3) Throughout the file
Set frames is like taking the BMP and its 24 bit variations, and as there are 10 variations used, making each 24 bit sequence into 4 bits to fit all 10 variations, and use 4 bits instead for each 24 bit string.
In a row is like looking for repetition of the same sequences, such as one colour which repeats for a chunk of the file, and using some information to store its sequence, offset, repetition and remove it from the file completely (and later be able to restore it based on the offset).
Throughout the file is like a general huffman style system which will specifically target repetition of sequences in the file and rewrite these on a code tree of sequences which vary in length and the idea is the more frequent ones will use less bits than the less frequent ones and should allow for overall compression this way (and the code tree sequences cannot be confused reading the file from left to right).
While each way of detecting compression allows for compression, each way will generally get compression that other ways will not.
Set frames will be able to reduce the overall file.
In a row will be able to get large chunks in a section (such as an optimal method for a 1GB file of 0 bits)
Throughout will get general repetition of specific sequences and recode it.
Throughout (like huffman) when done alone can leave repetition of coded sequences in a row if the repetition is prevalent in a row (like a 1GB file of 0 bits).
To fix that, a scan for 'in a row' in advance can prevent this where possible, leaving the throughout scan to simply find repetition throughout and between gaps and not write a file that has repetition in a row from its source having it.
By shrinking frames based on sequences used into less bits to represent, the in a row scan will use shorter offsets, sequences and repetition data to be stored in general.
Also, the throughout one (like huffman) will be done on a shrunk file of generally organized and common data as well as without finding sequences in a row.
The overall effect of doing 'set frames' then 'in a row' then 'throughout' is that they will inherit and result in a generally compressed file with inheritance from previous scans.
The idea is that each one of the methods, which are called 'MODES' will be thorough in their mannerism.
For set frames, it will not simply look for all sequences in the file on the whole.
It will specifically target areas in the file which can be shrunk and known by its offset locations, such as a dynamic method of shrinking areas by its sequence variations into less bits.
For example, a file can have prevalence of 5 24 bit sequences in multiple areas like 0x20-0x100, 0x200-0x500, 0x1000-0x2500 etc - it will shrink those specific areas and associate them to the same 'section', and thus dynamically shrink areas in a manner which will produce the most overall compression.
For in a row, it will use sequences which are long enough in a row that compliment more compression if done in conjunction to throughout when followed, instead of 48 0 bits in a row which could potentially be eligible for detection of a sequence that can be stored in a row - it will look for the shortest version of a repetitive sequence in a row that is 'worth recording'.
For throughout, it will determine the sequences which will overall save the most bits in the file on the whole, blank their offsets out and do rescans in this manner to end up with the least number of the most bit saving sequences used to assemble a file. The code tree will be constructed not by frequency but on ordering the most bit saving sequences to use the least bits on the code tree.
As each method is considered optimized, there are no 'compression levels'.
It is 3 optimized algorithms of detecting compression which when run one after the other with priority will overall compress the file with inheritance.
Since each of the 3 MODES are looking for repetition and compression in multiple ways, this is generally for all files on the spectrum where repetition can be found.
For each section found by MODE 1, such as a particular section like the 24 bit strings of 5 variations in the multiple areas, all areas are treated as the same section, and when using the 'in a row' and 'throughout' MODES following are done on each section as though it is a continuous string, adjusting offset information where required.
The result is 'in a row' and 'throughout' are done on shrunk sections with specific common data to encourage more data to be found 'in a row' and for throughout to be done on common data.
This will have more compression that doing 'in a row' or 'throughout' on the whole file even if it was shrunk into sections since the 'in a row' is specifically targeting common data that has been detected in advance and treated as continuous in the file, and the 'throughout' is looking for repetition in an area targeted to have few sequences and prevalence of them.
When each MODE is done once, one after the other (and in each respective section), the result is something that can generally be streamable in terms of decoding on the fly.
When each MODE is repeated more than once where possible, the result is a file like a russian doll with a doll in a doll, and the file is the smallest doll and the data to make all the larger ones.
This can for example get more data 'in a row' if done twice based on the result of removing the initial 'in a row' data, and treating the file like an accordion to get more compression where possible using the 3 optimized MODES of detecting compression (which compliment).
I have a writeup (which needs some adjusting - my prediction for compression is too high in general in the writeup), but general pseudocode is there and an expectation of having an overall highly compressed file (without compression levels), having a highly compressed 'streamable' file and highly compressed 'archival' file.
As there are no compression levels, and each MODE will output the 'ideal' result being the least bit version to output possible, it can systematically compress files in general where possible and by multiple methods of detecting compression.
Additional information can be put in the header such as resolution/bitrate to allow for the data being worked on to be raw data such as raw image or raw audio data, for overall maximum compression results. Additional information such as minor information required for synchronization can be done for immediate decoding of specific offsets (files) when using the streamable scheme.
The concept is a systematic manner of compressing a file using multiple complimenting and inheriting methods of detecting compression, and only using a MODE where it is detected in advance that compression will occur - scans for MODE 1 and MODE 2 will be done and if only MODE 3 will find compression, that is the only one used.
For this reason, it is generally calculation intensive and requires generous temporary storage to do - the result is a generally optimized file which can generally be accessed immediately.
While the writeup can do with some adjustment as well as pseudocode and a manner of providing an accurate systematic representation of header or data for example, I'll put the writeup here.
I have also made a post here:
http://forum.codecall.net/topic/779...assist-in-developing-a-compression-algorithm/
If anyone is interested in taking this concept and coding the algorithms of each MODE and allowing them to work systematically based on the selected SCHEME, this could generally be used for compression as a tool and with an expectation of high results.
For the archival scheme and its nature of repeating methods, I don't think compression tools like rar/7z go as far as this, and for a general streaming file it is expected to be highly compressed (from each MODE being optimized for ideal results) and accessible.
Any other MODES to add which will compliment and inherit are also welcome
My programming skills are low, however a general outline and pseudocode of expected results is there.
Thanks.