README.md

Documentation: github.com/vbatts/tar-split

     1# tar-split
     2
     3![Build Status](https://github.com/vbatts/tar-split/actions/workflows/go.yml/badge.svg)
     4![Lint](https://github.com/vbatts/tar-split/actions/workflows/lint.yml/badge.svg)
     5[![Go Report Card](https://goreportcard.com/badge/github.com/vbatts/tar-split)](https://goreportcard.com/report/github.com/vbatts/tar-split)
     6
     7Pristinely disassembling a tar archive, and stashing needed raw bytes and offsets to reassemble a validating original archive.
     8
     9## Docs
    10
    11Code API for libraries provided by `tar-split`:
    12
    13* [github.com/vbatts/tar-split/tar/asm](https://pkg.go.dev/github.com/vbatts/tar-split/tar/asm)
    14* [github.com/vbatts/tar-split/tar/storage](https://pkg.go.dev/github.com/vbatts/tar-split/tar/storage)
    15* [github.com/vbatts/tar-split/archive/tar](https://pkg.go.dev/github.com/vbatts/tar-split/archive/tar)
    16
    17## Install
    18
    19The command line utilitiy is installable via:
    20
    21```bash
    22go get github.com/vbatts/tar-split/cmd/tar-split
    23```
    24
    25## Usage
    26
    27For cli usage, see its [README.md](cmd/tar-split/README.md).
    28For the library see the [docs](#docs)
    29
    30## Demo
    31
    32### Basic disassembly and assembly
    33
    34This demonstrates the `tar-split` command and how to assemble a tar archive from the `tar-data.json.gz`
    35
    36
    37![basic cmd demo thumbnail](https://i.ytimg.com/vi/vh5wyjIOBtc/2.jpg?time=1445027151805)
    38[youtube video of basic command demo](https://youtu.be/vh5wyjIOBtc)
    39
    40### Docker layer preservation
    41
    42This demonstrates the tar-split integration for docker-1.8. Providing consistent tar archives for the image layer content.
    43
    44![docker tar-split demo](https://i.ytimg.com/vi_webp/vh5wyjIOBtc/default.webp)
    45[youtube vide of docker layer checksums](https://youtu.be/tV_Dia8E8xw)
    46
    47## Caveat
    48
    49Eventually this should detect TARs that this is not possible with.
    50
    51For example stored sparse files that have "holes" in them, will be read as a
    52contiguous file, though the archive contents may be recorded in sparse format.
    53Therefore when adding the file payload to a reassembled tar, to achieve
    54identical output, the file payload would need be precisely re-sparsified. This
    55is not something I seek to fix immediately, but would rather have an alert that
    56precise reassembly is not possible.
    57(see more http://www.gnu.org/software/tar/manual/html_node/Sparse-Formats.html)
    58
    59
    60Other caveat, while tar archives support having multiple file entries for the
    61same path, we will not support this feature. If there are more than one entries
    62with the same path, expect an err (like `ErrDuplicatePath`) or a resulting tar
    63stream that does not validate your original checksum/signature.
    64
    65## Contract
    66
    67Do not break the API of stdlib `archive/tar` in our fork (ideally find an upstream mergeable solution).
    68
    69## Std Version
    70
    71The version of golang stdlib `archive/tar` is from go1.11
    72It is minimally extended to expose the raw bytes of the TAR, rather than just the marshalled headers and file stream.
    73
    74
    75## Design
    76
    77See the [design](concept/DESIGN.md).
    78
    79## Stored Metadata
    80
    81Since the raw bytes of the headers and padding are stored, you may be wondering
    82what the size implications are. The headers are at least 512 bytes per
    83file (sometimes more), at least 1024 null bytes on the end, and then various
    84padding. This makes for a constant linear growth in the stored metadata, with a
    85naive storage implementation.
    86
    87First we'll get an archive to work with. For repeatability, we'll make an
    88archive from what you've just cloned:
    89
    90```bash
    91git archive --format=tar -o tar-split.tar HEAD .
    92```
    93
    94```bash
    95$ go get github.com/vbatts/tar-split/cmd/tar-split
    96$ tar-split checksize ./tar-split.tar
    97inspecting "tar-split.tar" (size 210k)
    98 -- number of files: 50
    99 -- size of metadata uncompressed: 53k
   100 -- size of gzip compressed metadata: 3k
   101```
   102
   103So assuming you've managed the extraction of the archive yourself, for reuse of
   104the file payloads from a relative path, then the only additional storage
   105implications are as little as 3kb.
   106
   107But let's look at a larger archive, with many files.
   108
   109```bash
   110$ ls -sh ./d.tar
   1111.4G ./d.tar
   112$ tar-split checksize ~/d.tar 
   113inspecting "/home/vbatts/d.tar" (size 1420749k)
   114 -- number of files: 38718
   115 -- size of metadata uncompressed: 43261k
   116 -- size of gzip compressed metadata: 2251k
   117```
   118
   119Here, an archive with 38,718 files has a compressed footprint of about 2mb.
   120
   121Rolling the null bytes on the end of the archive, we will assume a
   122bytes-per-file rate for the storage implications.
   123
   124| uncompressed | compressed |
   125| :----------: | :--------: |
   126| ~ 1kb per/file | 0.06kb per/file |
   127
   128
   129## What's Next?
   130
   131* More implementations of storage Packer and Unpacker
   132* More implementations of FileGetter and FilePutter
   133* would be interesting to have an assembler stream that implements `io.Seeker`
   134
   135
   136## License
   137
   138See [LICENSE](LICENSE)
View as plain text