DESIGN.md

Documentation: github.com/vbatts/tar-split/concept

     1# Flow of TAR stream
     2
     3## `./archive/tar`
     4
     5The import path `github.com/vbatts/tar-split/archive/tar` is fork of upstream golang stdlib [`archive/tar`](http://golang.org/pkg/archive/tar/).
     6It adds plumbing to access raw bytes of the tar stream as the headers and payload are read.
     7
     8## Packer interface
     9
    10For ease of storage and usage of the raw bytes, there will be a storage
    11interface, that accepts an io.Writer (This way you could pass it an in memory
    12buffer or a file handle).
    13
    14Having a Packer interface can allow configuration of hash.Hash for file payloads
    15and providing your own io.Writer.
    16
    17Instead of having a state directory to store all the header information for all
    18Readers, we will leave that up to user of Reader. Because we can not assume an
    19ID for each Reader, and keeping that information differentiated.
    20
    21## State Directory
    22
    23Perhaps we could deduplicate the header info, by hashing the rawbytes and
    24storing them in a directory tree like:
    25
    26	./ac/dc/beef
    27
    28Then reference the hash of the header info, in the positional records for the
    29tar stream. Though this could be a future feature, and not required for an
    30initial implementation. Also, this would imply an owned state directory, rather
    31than just writing storage info to an io.Writer.
    32
    33## Concept Example
    34
    35First we'll get an archive to work with. For repeatability, we'll make an
    36archive from what you've just cloned:
    37
    38```
    39git archive --format=tar -o tar-split.tar HEAD .
    40```
    41
    42Then build the example main.go:
    43
    44```
    45go build ./main.go
    46```
    47
    48Now run the example over the archive:
    49
    50```
    51$ ./main tar-split.tar
    522015/02/20 15:00:58 writing "tar-split.tar" to "tar-split.tar.out"
    53pax_global_header pre: 512 read: 52
    54.travis.yml pre: 972 read: 374
    55DESIGN.md pre: 650 read: 1131
    56LICENSE pre: 917 read: 1075
    57README.md pre: 973 read: 4289
    58archive/ pre: 831 read: 0
    59archive/tar/ pre: 512 read: 0
    60archive/tar/common.go pre: 512 read: 7790
    61[...]
    62tar/storage/entry_test.go pre: 667 read: 1137
    63tar/storage/getter.go pre: 911 read: 2741
    64tar/storage/getter_test.go pre: 843 read: 1491
    65tar/storage/packer.go pre: 557 read: 3141
    66tar/storage/packer_test.go pre: 955 read: 3096
    67EOF padding: 1512
    68Remainder: 512
    69Size: 215040; Sum: 215040
    70```
    71
    72*What are we seeing here?* 
    73
    74* `pre` is the header of a file entry, and potentially the padding from the
    75  end of the prior file's payload. Also with particular tar extensions and pax
    76  attributes, the header can exceed 512 bytes.
    77* `read` is the size of the file payload from the entry
    78* `EOF padding` is the expected 1024 null bytes on the end of a tar archive,
    79  plus potential padding from the end of the prior file entry's payload
    80* `Remainder` is the remaining bytes of an archive. This is typically deadspace
    81  as most tar implmentations will return after having reached the end of the
    82  1024 null bytes. Though various implementations will include some amount of
    83  bytes here, which will affect the checksum of the resulting tar archive,
    84  therefore this must be accounted for as well.
    85
    86Ideally the input tar and output `*.out`, will match:
    87
    88```
    89$ sha1sum tar-split.tar*
    90ca9e19966b892d9ad5960414abac01ef585a1e22  tar-split.tar
    91ca9e19966b892d9ad5960414abac01ef585a1e22  tar-split.tar.out
    92```
    93
    94
View as plain text