README.md 4.01 KB
Newer Older
1
# Modern Office Git Diff
Tomáš Hübelbauer's avatar
Tomáš Hübelbauer committed
2

3
> An experiment in tracking and diffing versions of modern Microsoft Office files in Git.
Tomáš Hübelbauer's avatar
Tomáš Hübelbauer committed
4

5
Modern Office file formats are ZIP archives with XML files in them.
6 7 8 9
The ZIP archives are binary files so Git (and furthemore GitHub, GitLab where diff cannot be tweaked) won't display a nice diff for them.
The XML files are not binary, so in order to display a diff for these, this unpacks the ZIP files to directories that are tracked in Git.
Tracking generated files is pretty dumb, but so is tracking binary files and when forced to have one,
it's not a leap to have the other as well if it bring something useful to the table.
Tomáš Hübelbauer's avatar
Tomáš Hübelbauer committed
10

11 12
This is achieved using a PowerShell script which unpacks the ZIP file to a tracked directory,
formats the XML files for nice diff and tracks the formatted files as well.
13

14 15 16 17 18 19 20
**Examples:**

The XML diff captures the exact change whereas the TXT diff captures text-only change for quick content inspection.

- [Example Word diff](https://github.com/TomasHubelbauer/modern-office-xml-git/commit/3413eacaaeb236a06033a443d7979f19207a613b)
- [Example Excel diff](https://github.com/TomasHubelbauer/modern-office-xml-git/commit/5f4ef47d345ab451f17e41ebf0befbc842ff5dba)

Tomáš Hübelbauer's avatar
Tomáš Hübelbauer committed
21
**Features:**
22

23
- Every Office file (DOCX, XLSX, PPTS) has complementary `.git` directory with XML and TXT files for diffing
24 25
- Formatting XML files for nicer diffing
- Generating TXT files from just text nodes for lossy text-only diffing
26 27 28
- Warning in extracted and generated content about read-onliness of the data
- Skipping processing unchanged files for fast operation even in repos with many Office files
- Removing associated generated content automatically for Office files that have been removed from the repo
29 30
- Ability to run as a Git hook for worry free tracking

31
**Limitations:**
Tomáš Hübelbauer's avatar
Tomáš Hübelbauer committed
32

33 34 35 36
- Stores compressed *and* uncompressed versions in Git - by design, for plain text diffing and binary source of truth
- No support for DOC, XLS and PPT, only XLSX, DOCX and PPTX (XML based formats) - by design, no use diffing binary formats
- Risk of getting generated files out of sync if hook is not run or a manual edit is made to the generated files
- Won't process files uploaded to repository through GitHub/GitLab online UI (no pre-commit hook)
Tomáš Hübelbauer's avatar
Tomáš Hübelbauer committed
37

38 39 40 41 42
**Support:**

- Windows: 10.0.16299+ (`Get-WmiObject -Class Win32_OperatingSystem | Select-Object Version`)
- Unix: 4.4.0+ (`uname -r`)

43
## Running
44

45 46
Run PowerShell scripts using VS Code PowerShell Integrated Console to avoid security blocks.
Open it by clicking on any `.ps1` file with integrated terminal open or running the *PowerShell: Show Integrated Console* VS Code command (F1).
47

48 49 50
- Run `cmd/version-office-files.ps1` from the command line
- Run `cmd/edit-in-powrshell-ise.ps1` to open in PowerShell ISE (Integrated Shell Environment)
- Add a Git pre-commit hook:
51

52
```sh
53 54
cp .git/hooks/pre-commit.sample .git/hooks/pre-commit
code .git/hooks/pre-commit
55 56
```

57
`.git/hooks/pre-commit`
58 59

```sh
60
#!/usr/bin/env bash
61
powershell cmd/version-office-files.ps1
62 63
```

64 65
Observe commit diffs to see Office file changes in the XML and TXT files.

66 67 68 69 70 71 72
## Testing

Run PowerShell scripts using VS Code PowerShell Integrated Console to avoid security blocks.
Open it by clicking on any `.ps1` file with integrated terminal open or running the *PowerShell: Show Integrated Console* VS Code command (F1).

Run `cmd/run-tests.ps1`.

73 74 75
## Contributing

See [planned development](doc/tasks.md).
76

77 78 79 80
## Studying

See `git log` and [development notes](doc/notes.md).

Tomáš Hübelbauer's avatar
Tomáš Hübelbauer committed
81 82 83 84
Some notable prior art:

- [Martin Fenner (2014)](http://blog.martinfenner.org/2014/08/25/using-microsoft-word-with-git/)
- [Ben Balter (2015)](https://ben.balter.com/2015/02/06/word-diff/)
85
- [Jon Hill (2017)](https://www.ficonsulting.com/filabs/MSOfficeGit)
Tomáš Hübelbauer's avatar
Tomáš Hübelbauer committed
86 87 88 89

All of these focus on on-demand (non-tracked) generating of text-only versions of the files, do not capture structure changes.
This project aims to explore the other, potentially less useful, but nonetheless interesting, route of versioning both
the compressed and the uncompressed forms of a file in parallel. See features and drawback for pros and cons.