...
1# Flow of TAR stream
2
3## `./archive/tar`
4
5The import path `github.com/vbatts/tar-split/archive/tar` is fork of upstream golang stdlib [`archive/tar`](http://golang.org/pkg/archive/tar/).
6It adds plumbing to access raw bytes of the tar stream as the headers and payload are read.
7
8## Packer interface
9
10For ease of storage and usage of the raw bytes, there will be a storage
11interface, that accepts an io.Writer (This way you could pass it an in memory
12buffer or a file handle).
13
14Having a Packer interface can allow configuration of hash.Hash for file payloads
15and providing your own io.Writer.
16
17Instead of having a state directory to store all the header information for all
18Readers, we will leave that up to user of Reader. Because we can not assume an
19ID for each Reader, and keeping that information differentiated.
20
21## State Directory
22
23Perhaps we could deduplicate the header info, by hashing the rawbytes and
24storing them in a directory tree like:
25
26 ./ac/dc/beef
27
28Then reference the hash of the header info, in the positional records for the
29tar stream. Though this could be a future feature, and not required for an
30initial implementation. Also, this would imply an owned state directory, rather
31than just writing storage info to an io.Writer.
32
33## Concept Example
34
35First we'll get an archive to work with. For repeatability, we'll make an
36archive from what you've just cloned:
37
38```
39git archive --format=tar -o tar-split.tar HEAD .
40```
41
42Then build the example main.go:
43
44```
45go build ./main.go
46```
47
48Now run the example over the archive:
49
50```
51$ ./main tar-split.tar
522015/02/20 15:00:58 writing "tar-split.tar" to "tar-split.tar.out"
53pax_global_header pre: 512 read: 52
54.travis.yml pre: 972 read: 374
55DESIGN.md pre: 650 read: 1131
56LICENSE pre: 917 read: 1075
57README.md pre: 973 read: 4289
58archive/ pre: 831 read: 0
59archive/tar/ pre: 512 read: 0
60archive/tar/common.go pre: 512 read: 7790
61[...]
62tar/storage/entry_test.go pre: 667 read: 1137
63tar/storage/getter.go pre: 911 read: 2741
64tar/storage/getter_test.go pre: 843 read: 1491
65tar/storage/packer.go pre: 557 read: 3141
66tar/storage/packer_test.go pre: 955 read: 3096
67EOF padding: 1512
68Remainder: 512
69Size: 215040; Sum: 215040
70```
71
72*What are we seeing here?*
73
74* `pre` is the header of a file entry, and potentially the padding from the
75 end of the prior file's payload. Also with particular tar extensions and pax
76 attributes, the header can exceed 512 bytes.
77* `read` is the size of the file payload from the entry
78* `EOF padding` is the expected 1024 null bytes on the end of a tar archive,
79 plus potential padding from the end of the prior file entry's payload
80* `Remainder` is the remaining bytes of an archive. This is typically deadspace
81 as most tar implmentations will return after having reached the end of the
82 1024 null bytes. Though various implementations will include some amount of
83 bytes here, which will affect the checksum of the resulting tar archive,
84 therefore this must be accounted for as well.
85
86Ideally the input tar and output `*.out`, will match:
87
88```
89$ sha1sum tar-split.tar*
90ca9e19966b892d9ad5960414abac01ef585a1e22 tar-split.tar
91ca9e19966b892d9ad5960414abac01ef585a1e22 tar-split.tar.out
92```
93
94
View as plain text