[Toybox] More than you really wanted to know about patch.

On 1/13/19 3:57 PM, scsijon wrote: > Any chance of a two or three page "Introduction to Creating and Understanding > Patches for Dummies" for those of us who either don't know how to build one, or > like me, have, "but don't really know what i'm doing". > > When you can make time of course, i'd really like to understand more of what the > group is doing with patches submitted rather than only a little. > > Please, with pure honey on crumpets. Patches are reasonably straightforward, if somewhat reverse engineered historically. Back in the 1980's somebody invented diff -u ("unified diff format") as a more human readable alternative o the <old >new lines format you get without the -u, and then Larry Wall whipped up a program to reverse the process and use saved diff -u output to modify a file (which was mind-blowing at the time). As far as I can tell the format wasn't really meant for that, and was made to work with heuristics and hitting it with a rock, but Larry _did_ go on to invent Perl... A patch is a series of "hunks", describing a range of lines in the "old" version and the corresponding range in the "new" version. Patches have 6 different types of lines, each starting with one of "+++ ", "--- ", "@@ ", " ", "+", or "-". The first 2 (the --- and +++ lines) are control lines that come at the start and indicate we're working on a new file. They indicate the old file name and the new file name for the changed files. If you "diff -u oldfile newfile" you get a hunk starting with: --- oldfile +++ newfile @@ -oldstart,oldlines +newstart,newlines @@ comment and so on Those first two lines are --- or +++, one space, and the filename. Unfortunately, the original unified diff format then followed each filename with a tab character and the timestamp of the file (in yyyy-mm-dd hh:mm:ss tzoff" format), which means if you have a tab character in the filename you can't patch them. These days this datestamp is optional, and most patches don't have them anymore. (I have a todo item to make toybox patch work backwards from the end of the line and peel off only a properly formatted tab+date entry and leave it alone otherwise, but right now it just stops at the first tab. Which is not a space or newline, and thus almost never occurs in filenames and nobody's complained yet (because if you tab in the windows gui it switches focus so windows people can't trivially create this breakage and then wine for us to "support" it)... Still, lemme do a quick commit to make that suck _slightly_ less by at least requiring the next character to be a digit in order to match the date and strip it off. It still doesn't handle filenames with a newline in them, but... how would you?) If this (now optional) date was the unix epoch (midnight, january 1, 1970, which timezone adjustements often moved to December 31, 1969), it indicated we were comparing against a nonexistent file. The more modern way to say this is to use the special filename /dev/null. So if you want patch to create a file, what you do is "diff -u /dev/null newfile", and if you want it to delete a file, "diff -u oldfile /dev/null". (Otherwise it leaves a zero length file when you remove allthe lines, or expects an empty file to already be there when adding with no context lines.) The other fun thing is when you diff 2 files, the files need to have different names. How do you know which one you're applying the patch to? Historically, it tried both names and used whichever one worked... but if you happen to have a file with your tempname lying around in the directory you're applying the patch _to_ (which happens a lot when you habitually use the same tempfile name), the hunk may try to apply to the wrong file. (There were certain horrible heuristics I don't remember that tried to work out what you _meant_ to do, which didn't really help and I don't think I implemented them?) And these days files have paths. As the switch from CVS to SVN (let alone git) taught us: individual standalone files aren't very interesting, you're almost always operating on a _tree_ of files. So generally what you do _now_ (and what tools like svn or mercurial or git pretend to do behind the scenes) is back up one directory, have two full trees (the vanilla project and your modified version), and "diff -ruN" the two subdirectories: -r is recursive, -u is unified format instead of the old < and > version, and -N says pretend to compare new or removed files against /dev/null so the diff says to add or remove them properly. That's why tools like svn or mercurial or git will create diffs that start like: +++ a/path/to/file --- b/path/to/file Except... now you've got an extra level of directory you don't want, so you have to back up _out_ of your project's tree to apply the patch and it's STILL guessing which name you mean. So what you do is create the diffs like that, then use the "-p 1" option when applying them, which says "peel off one layer of directory when parsing the filenames". That removes the a/ and b/ from the paths, and the rest should be identical so it's no longer ambiguous and it doesn't matter if you use the +++ or the --- line as the file to apply the patch to. (No, -p1 doesn't apply to the magic name /dev/null, absolute paths aren't modified, only relative ones. Also, you can say "-p0" to disable the above "certain horrible heuristics" on pathless filenames and just literally use the filenames in the patch, but that doesn't come up much these days. Creating a diff between two trees and applying it within the top level of the tree via "patch -p1" is nearly universal now. That's the format "git format-patch -1 $HASH" and "git am file.patch" are using, for example.) Ok, so all that's indicating what file hunks apply to, then you get to actual hunks describing what changes to make within the file. Each hunk starts with an @@ line, with 4 numbers, like so: @@ -start,len +start,len @@ comment Each "start" is the (decimal) line number in that file the hunk starts applying at, and the "len" is the (decimal) number of lines described in that file. These numbers measure the body of the hunk, which comes next. (The "comment" part can be anything, and doesn't even have to be there. It's ignored. Modern language-aware diff -u variants stick which C function you're modifying in there, which is nice for humans but not used by patch that I know of. This simple crappy heuristic there is "last unindented line", which can find goto labels: ...) Each line of the rest of the body of that hunk starts with one of three characters: 1) + meaning this line is only in the new version (it was added). 2) - meaning this line is only in the old version (it was removed). 3) " " (space) = this line is the same in both (it's context for the changes). The context lines plus + lines need to add up to the "len" in the + part of the @@ line, and the context lines plus - lines need to add up to the len in the - part. (The start is more or less a comment, used to indicate how far off it applies at if the hunk moved but otherwise not rally mattering as far as I can tell. Well toybox doesn't use it.) Note: if your code is tab indented, it still needs a space (ascii 32) at the start of it to be a context line, then it's binary identical for the contents (so tabs or spaces as appropriate). This causes some editors to flip out about mixing tabs and spaces, but the distinction is functional here. Patch opens files when it sees +++ --- line pairs, reads in the next @@ hunk and the appropriate number of lines after it (with the right number of context lines, additions, and removals for what the @@ line counts said), and then searches in the file for a place where the appropriate context lines and removed lines appear in the right order (removed lines are matched just like context, if they're not there in the file the hunk doesn't apply), then replaces it with the set of context lines and added lines the hunk says should go there instead. (Note that if you patch -r then it's the + lines being removed and the - lines being added, "reversing" the patch.) Each hunk generally starts with 3 leading context lines, and end with 3 trailing context lines, which generally provides enough context to uniquely identify where to apply the hunk even if you're just adding a single line (that's the pathological case of providing no other corroborating information). The exception is when you're hunk applies at the start or end of the file: then there aren't enough context lines, and may not be _any_ if you're right at the end or beginning of the file. The hunk also has interstitial context lines as appropriate (between the additions and removals, which also have to match or the hunk won't apply), but not more than 6 (leading + trailing context line count) or it'd split into 2 hunks. (This _does_ mean you can have 4 context lines in a row though.) What IS important is that you have the same number of leading context lines as trailing context lines, unless you're at the start/end of a file. If they don't, it's not a valid hunk and patch barfs on the corrupted patch. And the number of leading/trailing context lines not being the same means the patch program will try to MATCH the start/end of the file (whichever one's got truncated context), and fail if it can't (hunk does not apply, context is wrong). You can have as many hunks as you want within a file, I.E as many @@ lines after a given --- +++ pair, but the hunks must apply in order, and this INCLUDES the context lines. A line that's been "seen" as a trailing context line won't match against the leading context of the next hunk. Because of this, you sometimes need 3 or more interstitial context lines in a row in the _middle_ of a hunk (between + and - lines), if that's how your changes work out. A number of consecutive context lines matching the leading context does NOT end the hunk, only consumig the line counts from the @@ line does that. And then you figure out if leading/trailing context counts match (indicating the need to match start/end of file) _after_ that. (If you really want to back up and modify an earlier part of the file, you need a new --- +++ pair to flush and reopen the file, so it can start over searching at the beginning.) Oh, I know I said the start numbers in the @ line were only used for warnings, but you CAN use them to sanity check the leading context number if you want to. (Since if you're forcing a match with the beginning of the hunk, it had better start at 0 in that file or something is wrong.) Doesn't help with end of file though. So you wind up with: --- filename +++ filename @@ -start,len +start,len @@ context context context -blah +blah context context context @@ -start,len +start,len @@ ... Oh, the - lines usually come before the + lines when they're on the same line, but I don't think that's actually required? The entire context is matched before applying the hunk anyway. And note that you don't skip what you've already looked at when a hunk didn't apply, you go down ONE line and try matching again. If your context lines are all blank, you can skip the start of where this hunk applies otherwise. I hit and fixed that bug years ago in toybox. :) And of course all this is before git added a "rename" syntax that looks like: https://lwn.net/Articles/244448/ And has copy and delete variants that allow it to be much less verbose (avoids including the body of the matched file(s)). It's on the todo list... :) Rob P.S. You asked. > ps and i'm looking forward to the next mkroot, I miss Aborigonal! Alas, I just landed back in Milwaukee to do another round of $DAYJOB because neither toybox nor mkroot pay the bills. (I'm very grateful to the https://patreon.com/landley subscribers, and it's great encouragement, but my mortgage alone is like 25 times what that brings in. Nobody with a significant budget wants to fund this work, and keeping the lights on gets scheduled higher than things that don't. But I can presumably cut a mkroot release with the 4.20 kernel right after I do a toybox release at the end of the month. All 4.20 broke that I've noticed so far was adding sha256 as a hard requirement to the s390x build, and I can add that to the toybox airlock install passthroughs for the moment...) (I had a huge todo list for my month off... and wound up going limp for most of it. I was doing ok until the battery in this old laptop completely died (as in unplug = instant off, so suspend is useless and I lose all open windows every time I move it. And alas I did NOT get Devuan working on the new System76 laptop I ordered a few months back (binary wifi firmware tantrum in the installer), and what they preinstalled on it has systemd, and given a choice between "system with no battery" and "system with systemd" it's no contest. But I did get the new She-Ra and Hilda watched, and the first season of The Good Place, so that's something...) Still Rob