Delta Compression | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
|
||||||||||||
Download: |
|
|||||||||||
Minimal Data Transfer |
CBlockDelta Remote File Get with minimal data transfer. This class is for Client/Server File transfer across slow mediums. It is intended for use in Backup/Archiving/Versioning Applications. Any computer wanting a file from another computer simply calls CBlockDelta::Get(...) with appropriate arguments. The minimal data transfer is a result of Delta Compression: If the computer asking for the file already has a similar file (an older version), the data transmission simply describes the changes needed to convert the old file into the new file. In addition, the Data Blocks transmitted are LZW compressed if the system thinks its worth doing. The Get and Put routines use CArchives as communication links; you would normally use CSocketFile based CArchives.
The Delta-Compression Process
Normally when comparing two files you compare the first Byte, then the next, and so on, but over a slow medium that would mean copying the whole of the Remote file over anyway, for the comparison, so you need a way of comparing blocks of Bytes at a time. Comparing blocks can be done with checksums.... MD5 would compare two blocks very accurately, but its quite slow. CBlockDelta uses 512 Byte locks simply because it was found that that worked best and varying the block size makes the code ugly. The spark that makes delta-compression fire is the "Rolling Checksum" which allows you to get the checksum of the first 512 Bytes (all fine and normal so far) but then you can get the checksum of the 512 Bytes starting from Byte 1 simply by removing the first Byte (Byte[0]) and adding the next Byte (Byte[512]) with a little bit of maths (rather than re-visiting all the Bytes in the middle). This ability to "roll" the checksum forward one Byte allows us to efficiently scan a file for a block that has a particular checksum, and find that block no matter what its displacement is.
Walkthrough:
One computer wants the latest version of a file that it already has an old version of. So it creates an array of checksums, two entries for each 512 Byte block of the file that it has. One array entry uses the Rolling Checksum algorithm on the block (because its fast). The other array entry contains the MD5 checksum for the block (because its accurate). It sends this array of Checksums to a computer which has the more recent version of the file. This computer opens the more recent version and looks at the first 512 Bytes, forming the Rolling Checksum and sees if that checksum exists in the Checksum array. If it doesn't then it rolls one Byte forward in its file, by removing the first Byte from the rolling checksum, and adding the 513th Byte to the Rolling checksum. It looks again for the new value in the Checksum Array. Assume that it finds it, this time. Since the Rolling Checksum isn't very accurate (different blocks may give the same value), the MD5 checksum is calculated and checked with the appropriate entry in the array (which is now a Hash Table of Rolling Checksums, each with an array of matching MD5 Checksums, for speed). If an entry can be found with the same MD5 value, it is likely that it is the same block (true, it may not be, but the chances are microscopic - the chances of two blocks having the same MD5 is small, but having the same MD5 AND the same rolling checksum? So that means that there is one new byte at the beginning of the file and a block that is already known. CBlockDelta forms a CString which describes the differences (called Comparison) which is just a text file one line describes either a start and end index to data, or a block number or range of blocks. So if the first Byte was all that was added, and there were 1000 blocks of data afterwards, the Comparison string would simply say:1,1 0-999MakeDelta:
- All numbers in the string are in Hexadecimal.
- The string contains lines of text separated with '\n' characters.
- A single number on a line indicates a Block from the older File.
- A Block is 512 Bytes except for the last Block which may be shorter (FileSums::GetEndBlockLength()).
- A pair of numbers separated with a '-' indicate a range of adjacent Blocks from the older File ([First Block]-[Last Block] inclusive).
- A pair of numbers separated with a ',' indicate a range of adjacent Bytes from the newer File ([File Position],[Bytes to read]).
- If the Files are equal you get a single line string: 0-FileSums::GetBlockCount() indicating that all the Blocks of the older File make the newer file.
- If the Files are totally different you get a single line string: 0,FileSums::FileLength indicating that all the Bytes of the newer File make the newer file.
The computer with the latest version of the file then uses the Comparison String to return a binary stream of data that contains blocks of compressed data that the other computer doesn't have, and block numbers that it does have.
UseDelta:
The computer with the old version of the file creates a new file using the blocks from either the data stream, or the old file according to the instructions in the data stream.
Example Usage:
Derive a class from CBlockDelta that will create CSocketFile based CArchives.
The following example is of a Dialog Box with a CBlockDelta class within it.
Create a dialog application with a List Box (IDC_LIST1) and make the dialog class include the following code:
#include <afxsock.h> // MFC socket extensions #include "BlockDelta.h" class CBlockDeltaTesterDlg : public CDialog { public: CBlockDeltaTesterDlg(CWnd* pParent = NULL); // standard constructor void Say(CString Msg) {((CListBox*)GetDlgItem(IDC_LIST1))->AddString(Msg);} class CRemoteFileFetcher : public CBlockDelta { CBlockDeltaTesterDlg* Dlg; static DWORD FAR PASCAL ThreadProc(LPVOID lpData) {((CRemoteFileFetcher*)lpData)->Listen(); return 0;} HANDLE hThread; DWORD ThreadID; public: CRemoteFileFetcher(CBlockDeltaTesterDlg* Dlg) : Dlg(Dlg) {hThread=CreateThread((LPSECURITY_ATTRIBUTES)NULL, 0, (LPTHREAD_START_ROUTINE)ThreadProc, (LPVOID)this, 0, &ThreadID);} void Say(CString Msg) {Dlg->Say(Msg);} virtual void Getting (CString Remote) {Say("<Getting "+Remote);} // Called when starting to Get a file virtual void Putting (CString Local ) {Say(">Putting "+Local);} // Called when starting to Put a file virtual void SetBytesToRead (DWORD Bytes) {CString S; S.Format("<%u Bytes to read" ,Bytes); Say(S);} // Called as soon as BytesToRead is known virtual void SetBytesToWrite(DWORD Bytes) {CString S; S.Format(">%u Bytes to write",Bytes); Say(S);} // Called as soon as BytesToWrite is known virtual void SetInProgress (DWORD Bytes) {CString S; S.Format("<%u Bytes in" ,Bytes); Say(S);} // Called whenever Bytes are received virtual void SetOutProgress (DWORD Bytes) {CString S; S.Format(">%u Bytes out" ,Bytes); Say(S);} // Called whenever Bytes are sent virtual void Received (char Compression) {CString S; S.Format("<%u%% compressed" ,Compression); Say(S);} // Called when finished Putting (Compression is %Compressed) virtual void Sent (char Compression) {CString S; S.Format(">%u%% compressed" ,Compression); Say(S);} // Called when finished Getting (Compression is %Compressed) void Listen() { AfxSocketInit(); CSocket ServerSocket; ServerSocket.Create(14); ServerSocket.Listen(); for(;;) { CSocket Socket; ServerSocket.Accept(Socket); CSocketFile File(&Socket); CArchive ArI(&File, CArchive::load); CArchive ArO(&File, CArchive::store); Put(ArI,ArO); } } // Returns an error message or "" if everything worked. CString Fetch(CString Remote, CString Local, CString Output, CString ServerIP) { AfxSocketInit(); CSocket Socket; if(!Socket.Create() || !Socket.Connect(ServerIP, 14)) return "Socket creation failed"; CSocketFile SocketFile(&Socket); CArchive ArI(&SocketFile, CArchive::load); CArchive ArO(&SocketFile, CArchive::store); return Get(Remote,Local,Output, ArI,ArO); } }; ... The rest of the dialog class };Make the OnInitDialog() look like this:
BOOL CBlockDeltaTesterDlg::OnInitDialog() { CDialog::OnInitDialog(); CRemoteFileFetcher RemoteFileFetcher(this); CString S; S=RemoteFileFetcher.Fetch("Remote.txt","Local.txt","Output.txt", "127.0.0.1"); if(!S.IsEmpty()) AfxMessageBox(S); // could just return S if this is inside a Function returning a CString. return TRUE; // return TRUE unless you set the focus to a control }CRemoteFileFetcher creates a listening Thread as soon as it is instantiated. So if you run it on two computers, either computer may use the Fetch(...) function and the other will respond. For a quick test on a single computer you just use the IP address "127.0.0.1", otherwise use two computers. In the project's folder (the usual startup directory in Debug mode) copy the normal ReadMe.txt file to a file called Local.txt then copy Local.txt to a file called Remote.txt. Edit Local.txt with Notepad: select all the text and copy it, then paste it six times in succession. Now you have two similar files and this dialog will report what is happening when they are compared. If you were doing this on two computers, Remote.txt would be on the Server and Local.txt on the Client. The line in the OnInitDialog(): S=RemoteFileFetcher.Fetch("Remote.txt","Local.txt","Output.txt", "127.0.0.1"); tells the class to Fetch Remote.txt using Local.txt (if it can) to create Output.txt on the Local(Client) computer. The block size if fixed throughout the class as 512 Bytes (found by experiment to be optimum). The ReadMe.txt file was split into 7 blocks and one 175 Byte block which is sent compressed to 109 Bytes. Since the Blocks are only sent as numbers the zipped block was the only File data sent through the Sockets. Taking all socket data into account, the file was effectively sent through the Sockets 89% compressed and 96% compressed if the Remote.txt is the one with six pastes.
CBlockDelta requires the following additional files:
- LZW.h To compress blocks for transmission
- MD5.h For strong checksum
- Roller32.h For CRoller checksum
- DTree.h For CFO (File Object) data structure
- TreeList.h Used by DTree.h
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS 'AS IS' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.