GoNotes.adoc 32.9 KB
Newer Older
1
= Notes on the Go translation of Reposurgeon =
Anthony Fok's avatar
Anthony Fok committed
2
:source-highlighter: prettify
Eric S. Raymond's avatar
Eric S. Raymond committed
3
version 1.9, 2020-01-26
4

Julien "_FrnchFrgg_" Rivaud's avatar
Julien "_FrnchFrgg_" Rivaud committed
5
This is an experience report on a Python-to-Go translation of a
Eric S. Raymond's avatar
Eric S. Raymond committed
6
7
8
9
10
11
12
13
14
program with significant complexity, written in attempted conformance
with the Go community's practice for grounding language enhancement
requests not in it-would-be-nice-to-have abstractions but rather in a
description of real-world problems.

Reposurgeon is a program for editing version-control histories and
interconverting them among version-control systems. I spent much of
2019 moving the reposurgeon codebase from Python to Go because the
Python implementation was too slow to be useful on on really large
15
repositories.  The workflow it is designed to support is rapid
Eric S. Raymond's avatar
Eric S. Raymond committed
16
iterative improvement of conversion recipes by a human operator;
17
18
long test cycles disrupt this by making experimentation painful.

Julien "_FrnchFrgg_" Rivaud's avatar
Julien "_FrnchFrgg_" Rivaud committed
19
Subversion-to-git conversion of the Gnu Compiler Collection history,
20
21
22
23
24
25
26
at over 280K commits (over 1.6M individual change actions), was the
straw that broke the camel's back. Using PyPy with every optimization
on semi-custom hardware tuned for this job still yielded test
conversion times of over 9 hours, which is death on the
recipe-debugging cycle.  A test translation of the auxiliary
repocutter tool suggested that we could expect up to a 40x speedup,
which was in line with published Python vs. Go comparative benchmarks.
27
28

The problem directed the choice of Go, not the other way around.  I
29
30
31
32
seriously considered OCaml or a compiled Lisp as alternatives.  I
concluded that in either case the semantic gap between Python and
the target language was so large that translation would be
impractical. Only Go offered me any practical hope.
33

Eric S. Raymond's avatar
Eric S. Raymond committed
34
35
36
I did examine two automated tools for Python to Go translation, but
rejected them because I judged the generated Go code would have been a
maintainability disaster.  Thus, translation by hand.  Though at about
37
38
22% in I did write https://gitlab.com/esr/pytogo[a fast, crude,
incomplete Python-to-Go source translator] to assist the process.
Eric S. Raymond's avatar
Eric S. Raymond committed
39
40

The man barrier to translation was that, while at 14KLOC of Python
41
reposurgeon was not especially large, the code is very *dense*.  It's
Anthony Fok's avatar
Anthony Fok committed
42
a DSL that's a structure editor for attributed DAGs -- algorithmically
Eric S. Raymond's avatar
Eric S. Raymond committed
43
complex, bristling with graph theory, FSMs, parsing, tricky data
44
structures, two robot harnesses driving other tools, and three
Eric S. Raymond's avatar
Eric S. Raymond committed
45
46
different operator-composition algebras.  It became 21KLOC of Go.

47
The algorithmic density of reposurgeon is such that it would be a
Eric S. Raymond's avatar
Eric S. Raymond committed
48
49
50
51
challenge to the expressiveness of any language it were implemented
in.  It makes a good test of the relative expressiveness of Python and
Go, and an effective way to audit for places where moving from Python
to Go hinders concision and readability.
52
53

The skillset I approached this problem with is: Go novice, Python and
54
C expert; old Unix hand; Lisp hacker of even more ancient vintage;
55
56
57
58
59
ex-mathematician with strength in graph theory, group theory and
combinatorics; lots of experience as a systems programmer, lots of
exposure to odd languages, and lots of domain knowledge about
version-control systems.  Adjust bias compensators accordingly.

Eric S. Raymond's avatar
Eric S. Raymond committed
60
61
(The 1.8 version of these notes was shipped to golang-nuts on
2020-01-25; the topic thread is
62
63
https://groups.google.com/forum/#!topic/golang-nuts/u-L7PRa2Z-w[here].
A typo fix and a clarification have been backported from the thread.)
Eric S. Raymond's avatar
Eric S. Raymond committed
64

65
66
== Expected problems that weren't ==

Anthony Fok's avatar
Anthony Fok committed
67
*Semantic distance.*  In general, the translation gap between Python and
68
69
70
Go is narrower than I initially expected, especially considering the
dynamic-vs-static type-system difference.  On reflection, I think it
turns out that GC and the existence of maps as first-class types do
71
72
more to narrow that gap than the static/dynamic divergence in type
systems does to widen it.
73

Anthony Fok's avatar
Anthony Fok committed
74
75
*Polymorphic lists.*  The principal data structure of a repository's
representation in reposurgeon is a list of events -- data structures
76
77
78
79
representing modification operations on sets of files.  The list is
necessarily polymorphic because (for example) a change commit and a
tag creation are different kinds of things.  Translating this to
static typing using interfaces proved less arduous than I had feared,
Eric S. Raymond's avatar
Eric S. Raymond committed
80
81
though the process revealed a documentation issue and a problem
with absence of sum types; I will return to both points.
82

Anthony Fok's avatar
Anthony Fok committed
83
*Operator overloading.*  Good style in Python, but surprisingly easy to
84
relinquish in Go.  I went in thinking that they'd be on my want list
Anthony Fok's avatar
Anthony Fok committed
85
for Go before I was done, but no -- not even at reposurgeon's
86
87
complexity scale.

Anthony Fok's avatar
Anthony Fok committed
88
*Generics.*  Yes, these would have made translation easier, but the main
89
impact turned out to be that I had to write my own set-of-int and
90
91
92
93
set-of-string classes.  That was a surprisingly light hit.  What I
really missed was generic map-function-over-slice, which could be
handled by adding a much narrower feature.

94
The positive part of my summation is that hand-translation of Python
95
96
to Go even at this scale and complexity is not horrible.  It's not
*easy*, exactly, but quite doable.  It is however time-intensive;
97
counting time to build the required unit tests in Go, I managed about
98
99
150-200 lines a day before writing pytogo and 500-600 lines per day
afterwards.  The entire translation, interleaved with my work on NTPsec
100
101
102
and other projects, took just about 12 months in wall time. Perhaps
a third of that was spent debugging the Go result after it achieved
first full compilation.
103

104
The pain points were mainly around a handful of advanced Python
105
106
features: iterators, generators, comprehensions, and class-valued
exceptions.  Fortunately, even in quite advanced Python code like
107
reposurgeon's these turn out to not be very common on a per-KLOC basis.
108

109
== Problems that were ==
110
111
112

=== Keyword arguments ===

113
The problem that obtruded on me first was quite unexpected: absence of
114
115
116
keyword arguments.  In Python one can write a function signature like
this

Anthony Fok's avatar
Anthony Fok committed
117
[source,python]
118
119
120
121
122
123
----------------------------------------------------------------------
    def EuclideanDistance(x, y):
----------------------------------------------------------------------

and then call it like this:

Anthony Fok's avatar
Anthony Fok committed
124
[source,python]
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
----------------------------------------------------------------------
    d = EuclideanDistance(x=3.0, y=9.6)
----------------------------------------------------------------------

I used keyword arguments extensively in the Python, especially in
object-creation functions where it is often required to pass in
multiple parameters that won't fit neatly into a single data
structure.

Go presently has no such feature. This is probably the single most
serious readability hit my translation took; it got *significantly* more
difficult to grasp what was going on at all those callsites.

=== No map over slices ===

Translating Python map() calls and comprehensions produces code that
141
is ugly and bulky, forcing the declaration of dummy variables that
142
shouldn't need to exist.
143

144
If one graded possible Go point extensions by a figure of merit in which the
145
numerator is "how much Python expressiveness this keeps" and the
Anthony Fok's avatar
Anthony Fok committed
146
denominator is "how simple and self-contained the Go feature would be",
147
148
149
150
151
152
153
154
155
156
I think this one would be top of list.

So: map as a functional builtin takes two arguments, one x = []T and a
second f = func(T)T. The expression map(x, f) yields a new slice in
which for each element of x, f(x) is appended.

This proposal can be discarded if generics are implemented, as any
reasonable implementation of generics would make it trivial to
implement in Go itself.

157
=== Annoying limitations on const ===
158
159
160
161

Inability to apply const to variables with structure, map, or slice
initializers is annoying in these ways:

162
1. Compiler can't enforce _noli me tangere_
163
164
165
166
167

2. const functions as a declaration of programmer intent that is
   valuable at scale.

In Python one can often get a similar effect by using tuples.  I used
168
this as a form of internal documentation hint in the original Python.
169
I want it back in Go.
170

Julien "_FrnchFrgg_" Rivaud's avatar
Julien "_FrnchFrgg_" Rivaud committed
171
Any extension in the scope of const, even a relatively conservative
172
173
174
175
176
177
178
179
one like only allowing const structures with compile-time constant
members, would have significant benefits.

=== Absence of lookbehind in Go regexps ===

This is a small point problem, easily fixed, that was far more
annoying in practice than it should have been in theory.

180
181
182
Python regexps have both positive and negative lookbehind clauses.
The following expression looks for possible Subversion revision
designators in comments, excluding bug references:
183

Anthony Fok's avatar
Anthony Fok committed
184
 "(?<!bug )[0-9]+"
185
186
187
188
189

Go translation reveals that it is remarkably unpleasant, verging on
"too painful to be worth it" to do that filtering without lookbehinds.

This is the only real problem I have identified in moving from Python
190
191
192
regexps to Go ones.  Take that "only" seriously, because regexps are a
Swiss-army knife I use heavily; Go regexps are doing well to have no
limits that are more annoying.
193

194
=== Absence of sum/discriminated-union types ===
Eric S. Raymond's avatar
Eric S. Raymond committed
195

Anthony Fok's avatar
Anthony Fok committed
196
197
I have read https://github.com/golang/go/issues/19412[issue #19412]
and am aware of the objections to adding sum types to Go.
Eric S. Raymond's avatar
Eric S. Raymond committed
198
199
200
201
202
203

Nevertheless, I found their absence was something of a pain point in my
translation.  Because reposurgeon events can have any one of a set of
types (Blob, Tag, Commit, Callout, Passthrough, Reset) I found myself
writing a lot of stupid boilerplate code like this:

Anthony Fok's avatar
Anthony Fok committed
204
[source,go]
Eric S. Raymond's avatar
Eric S. Raymond committed
205
--------------------------------------------------------------------
Anthony Fok's avatar
Anthony Fok committed
206
207
208
209
210
211
212
213
214
215
216
	for _, child := range commit.children() {
		switch child.(type) {
		case *Commit:
			successorBranches.Add(child.(Commit).branch)
		case *Callout:
			complain("internal error: callouts do not have branches: %s",
				child.idMe())
		default:
			panic("in tags method, unexpected type in child list")
		}
	}
Eric S. Raymond's avatar
Eric S. Raymond committed
217
218
219
--------------------------------------------------------------------

Besides being inelegant, the requirement for a runtime check to
220
exhaust all cases is a defect attractor.  It's way too easy to forget
Eric S. Raymond's avatar
Eric S. Raymond committed
221
222
223
224
to write the default case and wind up with silent errors.

Thus, absence of discriminated-sum types is an actual hole in the
language that compromises its goal of enforcing strong invariants
225
through type safety checked at compile time.
Eric S. Raymond's avatar
Eric S. Raymond committed
226
227

This will especially tend to become an issue when translating from
228
a language like Python with fully dynamic typing.
Eric S. Raymond's avatar
Eric S. Raymond committed
229
230
231
232

I don't have a concrete proposal to fix this yet. If these notes
are well received I may write one.

233
===  Catchable exceptions require silly contortions ===
234

Julien "_FrnchFrgg_" Rivaud's avatar
Julien "_FrnchFrgg_" Rivaud committed
235
Though I revised it significantly on completion, much of this report
236
237
238
239
was originally written at about the 12% point of the translation. By
twice that far in, 23%, another problem about which I had not
originally been intending to complain became obtrusive. That is
absence of a general facility for structured exceptions.
240
241
242

Yes, I'm familiar with all the reasons throw/catch wasn't included in
Go 1.  Including the laudable goal of forcing programmers to be
243
explicit about error handling and how they propagate errors up their
244
245
call stack.  And I understand that defer/recover was an attempt to
provide a tractable subset of catchable exceptions that would minimize
246
247
248
the temptation to sin.

Because I broadly agree with this set of goals, I was actively
249
intending when I started this translation not to complain about the lack
250
251
252
of general catchable exceptions, or ship any related RFEs, in spite of
having a presentiment that they would be a problem.  That is, until
I hit a wall in the real world and had to rethink.
253
254

Here's my use case. Reposurgeon is an interpreter for a DSL.
255
256
257
Situations in which I can tolerate panic-out and die are rare and
mostly occur at initialization time. Usually what I want to do instead
of panicking on error is throw control back to the read/eval loop,
258
259
260
executing some kind of local cleanup hook on the way out.  Analogous
situations will frequently occur in, for example, network servers.

261
262
263
In a language with labeled throw/catch, or class-valued exceptions, I
can address this by explicitly target an exception to some level of
the call stack above the point it's raised.  In reposurgeon, for
264
example, there are usually two levels of interest.  One is the DSL's
265
266
267
268
269
270
271
272
read-eval loop. The other is the outermost scope; if an exception gets
there I want to call hooks to gracefully remove working directories
(blob storage associated with the repository-history structures being
edited) before exiting the program.

In Go, I didn't seem to have a clean option for this.  Which was a
problem on two levels....

273
274
275
276
277
278
279
280
1. Python reposurgeon was 14 KLOC of *dense* code.  At that scale, any
prudent person in a situation like this will perform as linear and
literal a translation as possible; to do otherwise is to risk a
complexity explosion as you try to cross the semantic gap and rethink
the design at the same time.  Absence of class-valued exceptions was
far and away the biggest technical blocker.  "First make it work, then
make it right"; the least risky path seemed to be to shim in
exceptions with the intention of removing them later.
Anthony Fok's avatar
Anthony Fok committed
281
+
282
283
Eventually, after beating on the panic/recover feature for a while, I
found this kludge:
Anthony Fok's avatar
Anthony Fok committed
284
285
+
[source,go]
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
---------------------------------------------------------------------
package main

import "fmt"

type exception struct {
	class string
	message string
}

func (e exception) Error() string {
	return e.message
}

func throw(class string, msg string, args ...interface{}) *exception {
	// We could call panic() in here but we leave it at the callsite
	// to clue the compiler in that no return after is required.
	e := new(exception)
	e.class = class
	e.message = fmt.Sprintf(msg, args...)
	return e
}

func catch(accept string, x interface{}) *exception {
	// Because recover() returns interface{}.
	// Return us to the world of type safety.
	if x == nil {
		return nil
	}
	err := x.(*exception)
	if err.class == accept {
		return err
	}
	panic(x)
}

func main() {
	defer println("Defer 1")
	defer println("Defer 2")
	defer println("Defer 3")

	defer func() {
		fmt.Println("Recover:", catch("recoverable", recover()))
	}()
	panic(throw("recoverable", "Don't Panic!!!"))

	fmt.Println("Unreachable.")
}
---------------------------------------------------------------------
Anthony Fok's avatar
Anthony Fok committed
335
+
336
337
338
339
340
341
342
343
This works, and it works if you change the class to something other
than "recoverable"; you get the expected rethrow and panic. But
it is unreasonably ugly.  So why am I bringing it forward? Because...

2. The translation experience reduced my disposition to think that Go is
right to be narrow and prescriptive on this issue.  Two kinds of
doubts grew on me:

Anthony Fok's avatar
Anthony Fok committed
344
* *Pragmatic doubt.* Trying to be a good Go citizen, I kept looking at
345
346
places where existing nonlocal control transfers in Python could be
replaced by explicit Go-style passing upwards of an error status.  But
347
348
I noticed that there were a significant percentage of cases in which
doing this made the code more difficult to follow rather than easier.
Anthony Fok's avatar
Anthony Fok committed
349
+
350
351
352
353
A simple representative example is a call chain of several data
transformations in which each stage has its own failure condition and
any failure aborts the transformation.  If we there were no error
cases we might write, in a Pythonoid sort of notation:
Anthony Fok's avatar
Anthony Fok committed
354
355
+
[source,python]
356
----------------------------------------------------------------
Anthony Fok's avatar
Anthony Fok committed
357
sink = transform3(transform2(transform1(source)))
358
----------------------------------------------------------------
Anthony Fok's avatar
Anthony Fok committed
359
+
360
361
If a stage can error out, we might have these structural alternatives to
consider.  One is Go style:
Anthony Fok's avatar
Anthony Fok committed
362
363
+
[source,python]
364
365
366
---------------------------------------------------------------
(fail1, result1) = transform1(source)
if fail1 == true:
Anthony Fok's avatar
Anthony Fok committed
367
    status = Exception1
368
else:
Anthony Fok's avatar
Anthony Fok committed
369
370
371
372
373
374
375
376
377
378
    (fail2, result2) = transform2(result1)
    if fail2 == true:
        status = Exception2
    else:
        (fail3, result3) = transform3(result1)
        if fail3 == true:
            status = Exception3
        else:
            sink = result3
            status = OK
379
---------------------------------------------------------------
Anthony Fok's avatar
Anthony Fok committed
380
+
381
The other style is with a catchable exception:
Anthony Fok's avatar
Anthony Fok committed
382
383
+
[source,python]
384
385
386
387
388
389
390
---------------------------------------------------------------
status = OK
try:
    sink = transform3(transform2(transform1(source)))
except (Exception1, Exception2, Exception3) as err:
    status = err
---------------------------------------------------------------
Anthony Fok's avatar
Anthony Fok committed
391
+
392
393
394
395
I don't think there's even a colorable argument that the Go structure is
better in a case like this. Look at all those extra variables, that
eye-confusing ladder structure, the defect-prone near-but-not-quite
repetition of code.
Anthony Fok's avatar
Anthony Fok committed
396
+
397
398
An early reviewer pointed out that if the Go code were an entire
function it could be expressed something like this:
Anthony Fok's avatar
Anthony Fok committed
399
400
+
[source,go]
401
---------------------------------------------------------------
Anthony Fok's avatar
Anthony Fok committed
402
func pipeline(source T) {
403
404
	result1, err1 := transform1(source)
	if err1 != nil {
Anthony Fok's avatar
Anthony Fok committed
405
		return err
406
407
408
409
	}

	result2, err2 := transform2(result1)
	if err2 != nil {
Anthony Fok's avatar
Anthony Fok committed
410
		return err
411
412
413
414
	}

	result3, err3 := transform3(result2)
	if err3 != nil {
Anthony Fok's avatar
Anthony Fok committed
415
416
		return err
	}
417
418
419
420

	return nil
}
---------------------------------------------------------------
Anthony Fok's avatar
Anthony Fok committed
421
+
422
423
That's still a lot of eyeball friction compared to functional-style with
exceptions. And it gets worse faster as the number of stages rises.
Anthony Fok's avatar
Anthony Fok committed
424
+
425
My problem was that I kept finding analogous situations in my
426
427
428
429
translation.  The specific one that motivated the above pseudocode
was in a feature called "extractor classes".  There are little
bots that run the client tools of a VCS to mine the output for its
metadata.  It's actually a five- or six-stage process wherein
Anthony Fok's avatar
Anthony Fok committed
430
431
any command failure requires an abort.
+
432
In these cases moving to Go style produced a serious
433
loss of clarity.  And a rising feeling that I wanted my exceptions
434
435
back (and in fact the extractor-class code now contains the one real
instance of my exceptions kludge).  Which leads to this:
436

Anthony Fok's avatar
Anthony Fok committed
437
* *Aesthetic doubt.* I've never written a general-purpose language,
438
439
440
but I have designed way more than my share of DSLs and declarative
markups, and from this I have learned a heuristic for doing engineering
that I won't regret.  For any given capability X:
Anthony Fok's avatar
Anthony Fok committed
441
+
442
443
444
445
446
447
Being able to express X elegantly is a good place to be.  Leaving out
X entirely for safety and verifiability can be a good choice, and is
at least defensible on those grounds.  But if you implement X in a
half-hearted, weak way that requires ugly code to use and fails to
actually foreclose the conceptual problems you were trying to dodge,
that's a bad place to be.
Anthony Fok's avatar
Anthony Fok committed
448
+
449
450
That bad place is where Go is right now with respect to nonlocal
control transfers, and why I had to write my kludge.
Anthony Fok's avatar
Anthony Fok committed
451
+
452
453
Interestingly, I was also able to come up with a very minimalist
solution.  No new syntax, two minor new compilation rules.
Anthony Fok's avatar
Anthony Fok committed
454
+
455
456
To motivate it, let's set the goal of being able to rewrite my example
like this:
Anthony Fok's avatar
Anthony Fok committed
457
458
+
[source,go]
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
---------------------------------------------------------------
package main

import "fmt"

type exception struct {
	class string
	message string
}

func (e exception) Error() string {
	return e.message
}

func throw(class string, msg string, args ...interface{}) {
	e := new(exception)
	e.class = class
	e.message = fmt.Sprintf(msg, args...)
	panic(e)
}

func catch(accept string) *exception {
	if x := recover(); x == nil {
		return nil
	}
	err := x.(*exception)
	if err.class == accept {
		return err
	}
	panic(x)
}

func main() {
	defer println("Defer 1")
	defer println("Defer 2")
	defer println("Defer 3")

	defer func() {
		fmt.Println("Recover:", catch("recoverable"))
	}()
	throw("recoverable", "Don't Panic!!!")

	fmt.Println("Unreachable.")
}
---------------------------------------------------------------
Anthony Fok's avatar
Anthony Fok committed
504
+
505
That is rather less ugly, actually pretty reasonable if the
506
implementations of throw and catch aren't staring you in the face.
507
508
And all it would take to get there is two minor loosenings of
restrictions.
Anthony Fok's avatar
Anthony Fok committed
509
510
511
+
[arabic]
.. The panic function has a new property, "terminating". If the
512
513
514
515
516
517
compiler can prove that all exit paths from a function invoke
terminating functions, it is marked "terminating".  The effect of
this property is to suppress "missing return" errors on any code path
from call of a terminating function to exit of its caller, *but not on
other paths to exit*.

Anthony Fok's avatar
Anthony Fok committed
518
.. A recover() call is no longer required to be within the lexical
519
520
521
522
frame of a defer(). It can be in a helper called by the defer clause
(but still within the call scope of a defer). For safety we'd need
an additional rule that a go clause in the helper puts the code it
runs out of scope for purposes of this check.
523

524
525
526
527
528
529
530
531
532
=== Absence of iterators ===

Having Python iterators go missing is really annoying for reposurgeon,
in which lazy evaluation of very long lists is a frequent requirement.

Here's the type example.  I have in my repository representation a
list of possibly hundreds of thousands of events.  A subset of these
events is Commit objects.  I would like to be able to write

Anthony Fok's avatar
Anthony Fok committed
533
[source,go]
534
---------------------------------------------------------------
Anthony Fok's avatar
Anthony Fok committed
535
536
	for i, commit := range repo.commits() {
		do_stuff_to(commit)
537
538
539
540
541
542
	}
---------------------------------------------------------------

In Python it is easy and natural to write commits() as an iterator
which lazily walks the repository event list looking for Commit
objects. Each time it is called it either returns with "yield",
Anthony Fok's avatar
Anthony Fok committed
543
handing back the next commit, or actually returns -- which is a signal
544
545
546
547
548
549
550
551
that the for loop should terminate.

I can't do this in Go; I have to write commits() to return an entire
constructed slice made by filtering the event list.  Which is annoying
for long lists, especially when it might well terminate early.

Sure, there's an alternative.  It looks like this...

Anthony Fok's avatar
Anthony Fok committed
552
[source,go]
553
---------------------------------------------------------------
Anthony Fok's avatar
Anthony Fok committed
554
555
	for i, event := range self.events {
		switch event.(type) {
556
		case *Commit:
Anthony Fok's avatar
Anthony Fok committed
557
558
			do_stuff_to(event.(*Commit))
		}
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
	}
---------------------------------------------------------------

...and about which what I have to say is "Ugh!".  That code does not
say "walk through all commits", it says "walk through all events and
do something to the ones that happen to be commits".  I don't want to
wander into event-land here; that type-assertion/cast pair looks
altogether too much like a defect attractor. Also, unnecessary eyeball
friction.

I had no good idea what could be done about this.  I read Ewen
Cheslack-Postava's excellent discussion of
https://ewencp.org/blog/golang-iterators/index.html[iterator patterns
in Go] and agreed with him that none of them are really satisfactory.

Annoyingly, the iterator pattern he suggests is almost the right
Anthony Fok's avatar
Anthony Fok committed
575
thing -- except for the part where early break from a channel-based
576
577
iterator leaves its goroutine running and some uncollectible garbage.

578
Then, on my second reading, I had a brainstorm.  I found a trivial
579
580
581
582
583
584
585
586
587
588
Go extension that would give iterators with no new syntax, no hidden
magic, and no yield/return distinction:

New evaluation rule on how to interpret for loops when the range
operand is a callable: the loop runs as a generator, yielding each
value in succession, until the callable returns the zero value of its
type.

So, with that I could write a Repository method like this:

Anthony Fok's avatar
Anthony Fok committed
589
[source,go]
590
---------------------------------------------------------------
591
592
// Iterator variant A: range stops on a zero value

593
594
595
596
597
func (repo *Repository) commits() func() *Commit {
	idx := -1
	return func() *Commit {
		for {
			if idx++; idx >= len(self.events) {
Anthony Fok's avatar
Anthony Fok committed
598
				return nil
599
600
601
602
			}
			if _, ok = self.events[idx].(*Commit); ok {
				return self.events[idx]
			}
Anthony Fok's avatar
Anthony Fok committed
603
		}
604
605
606
607
608
609
610
	}
}
---------------------------------------------------------------

...and there I have it.  An iterator, with exactly the same lifetime
as the for loop.

Julien "_FrnchFrgg_" Rivaud's avatar
Julien "_FrnchFrgg_" Rivaud committed
611
Then I thought it might be best to make this properly parallel to the
612
way iteration via range works.
613

Anthony Fok's avatar
Anthony Fok committed
614
[source,go]
615
---------------------------------------------------------------
616
617
// Iterator variant B: stop variable.

618
619
620
621
622
func (repo *Repository) commits() func() *Commit {
	idx := -1
	return func() (*Commit, bool) {
		for {
			if idx++; idx >= len(self.events) {
Anthony Fok's avatar
Anthony Fok committed
623
				return nil, false
624
625
626
627
			}
			if _, ok = self.events[idx].(*Commit); ok {
				return self.events[idx], true
			}
Anthony Fok's avatar
Anthony Fok committed
628
		}
629
630
631
632
633
	}
}
---------------------------------------------------------------

With this form the iterator could pass back zero values without
634
635
terminating, terminating only when the second return value from the
function-valued range argument goes to false.
636

637
I suggest that one of these be adopted for a future release of Go. Small, easy
638
639
new evaluation rule, big gain in expressiveness.

640
641
642
643
644
645
646
647
648
=== Hieratic documentation ===

Figuring out how to do type-safe polymorphism in the event list was
more difficult than it should have been.  The problem here wasn't the
Go language, it was the official (and unofficial) documentation.

There are two problems here, one of organization and one of style.

The organization problem is that there isn't one.  The official Go
649
650
651
652
653
654
documentation seems to center on the library API docs, the
specification, the Tour, and a couple of "official" essays written for
it. It also includes a corona of white papers and blog posts.  Often
these are valuable deep dives into specific aspects of the language
even when they are notionally obsolete.  Some of them are outside the
boundaries of the official documentation site.
655
656
657
658
659
660

For example, I got substantial help understanding interfaces from an
old blog post by Ian Lance Taylor (one of the Go devs) that was
offsite, dated from 2009, and contained obsolete implementation
details.

661
The high-level problem is that while the Go devs have done a praiseworthy
662
663
664
665
and unusually effective job of documenting their language considering
the usual limitations of documentation-by-developers, finding things
in the corona is *hard*.  And knowing what's current is *hard*.

666
667
668
669
The documentation is (dis)organized in such a way that it's difficult
to know what you still don't know after reading a Tour page or blog
entry or white paper. There should be more "But see here for a
dangerous detail" links, in particular to the language specification.
Eric S. Raymond's avatar
Eric S. Raymond committed
670

671
672
673
674
675
676
Style. Go has a problem that is common to new languages with opinionated
developers (this is part of "the usual limitations" above).  There are
one or two exceptions, but the documentation is predominantly written
in a terse, hieratic style that implicitly assumes the reader already
inhabits the mindset of a Go developer.

677
678
679
680
The documentation is *not* very good at providing an entry path into
that mindset.  Not even for me, and I'm an extreme case of the sort of
person for whom it *should* do an effective job if it can do that for
anyone.
681
682
683
684
685
686
687
688

There is a fix for both problems.  It is not magic, but it is doable.

The Go dev team should bring in a documentation specialist with no
initial knowledge of Go and a directive to try to maintain an
outside-in view of the language as he or she learns.  That specialist
needs to be full-time on the following tasks:

Anthony Fok's avatar
Anthony Fok committed
689
(1) Edit for accessibility -- a less hieratic style
690
691
692
693
694
695
696
697

(2) Maintain a documentation portal that attempts to provide a
reasonable map of where everything is and how to find it.

(3) Curate links to third-party documents (for example notable Stack
Overflow postings), with dates and attached notes on what parts might
be obsolete and when the document was last reviewed for correctness.

Eric S. Raymond's avatar
Eric S. Raymond committed
698
699
(4) Bring the very best third-party stuff inside, onto https://golang.org/doc/.

700
701
Note: After writing this, I had an even worse time digging up and
fixing in my mind all the details of how defer/panic/recover works.
Eric S. Raymond's avatar
Eric S. Raymond committed
702
It's almost all documented somewhere, though Peter Seebach and I ended
703
up writing a FAQ entry on how to set local variables from a defer clause to
704
clear up minor confusion. There's a very helpful blog
Eric S. Raymond's avatar
Eric S. Raymond committed
705
post on the general topic.  But the blog post leaves out the crucial detail
706
707
708
709
710
711
712
713
that recover returns interface {}, not error; this tripped me up when
I was writing my kludge, and I ended up on IRC getting referred to the
formal Go specification.

This is all too typical. Everything makes sense once you know it, but
before you know it critical details are often lurking in places you
have no way of knowing you should look.

714
Attention to the problem and a good technical writer/editor can fix this.
715

716
717
718
719
720
721
722
723
724
725
726
727
== Outcomes ==

My performance objectives were achieved. I didn't get a fully 40x
speedup, but only because the running time of the GCC conversion is
dominated by the low speed of the Subversion tools.  The non-I/O
limited part of processing fell from about 7 hours to about 20 minutes
(about 20x), and the overall speedup over Python was about 10x.

Additionally, maximum working set drastically decreased. These
improvements re-enabled the workflow reposurgeon was designed for,
rapid iterative improvement of conversion recipes.

Eric S. Raymond's avatar
Eric S. Raymond committed
728
On January 12th 2020 the production conversion of the GCC history from
729
730
731
732
Subversion to Git actually took place.

I am no longer a Go novice. :-)

733
734
== Accentuating the positive ==

Anthony Fok's avatar
Anthony Fok committed
735
The Go translation of reposurgeon is better -- more maintainable -- code
Eric S. Raymond's avatar
Eric S. Raymond committed
736
than the Python original, not just faster.  And this is *not* because I
737
738
rewrote or refactored as I went; as I've explained, I tried very hard
to avoid that. It's that Go's minimalistic approach actually...works.
739
740
741
742
743
744

I see a maintainability benefit from the static typing. The Go type
system does what a type system is supposed to do, which is express
program invariants and assist understanding of its operational
semantics.

745
746
747
748
749
750
751
752
753
754
755
Translation from Python, which is a dreadful language to try to do
concurrency in due to its global interpreter lock, really brought home
to me how unobtrusively brilliant the Go implementation of
Communicating Sequential Processes is. The primitives are right and
the integration with the rest of the language is wonderfully
seamless. The Go port of reposurgeon got some very large speedups at
an incremental-complexity cost that I found to be astonishingly low.
I am impressed both by the power of the CSP part of the design and the
extreme simplicity and non-fussiness of the interface it presents. I
hope it will become a model for how concurrency is managed in future
languages.
756

757
758
I've also seen a maintainability benefit from how easy Go makes it to
write unit tests in parallel with code.
759

760
761
762
763
The Go profiling tools (especially the visualization parts) are
extremely effective, much better at smoking out hidden
superlinearities in algorithms than Python's.

764
765
766
I have to call out the Go time library as a particularly good piece of
work. Having the basic timestamp property be location-aware with its
presentation modified by the implied zone offset simplified a lot
767
of cruft out of the handling of committer/author dates in Python.
768

769
770
771
772
Now that I've seen Go strings...holy hell, Python 3 unicode strings
sure look like a nasty botch in retrospect. Good work not falling into
that trap.

773
== Translation pragmatics ==
Eric S. Raymond's avatar
Eric S. Raymond committed
774
775
776
777
778
779

The strategy of writing as literal as possible a translation of the
Python first, at the cost of generating Go code that was initially
clunky and unidiomatic, worked quite well.  It actually took effort
and discipline to refrain from trying to improve the code as it passed
through translation, but I am very glad I expended that effort.  It
Eric S. Raymond's avatar
Eric S. Raymond committed
780
kept the translation process sane and controllable.
Eric S. Raymond's avatar
Eric S. Raymond committed
781
782
783
784
785
786
787
788
789
790
791
792
793
794

I should note that a prerequisite for a translation like this is an
excellent test suite.  As I write (in late January 2020), reposurgeon
at 22KLOC, has 52 unit tests, 177 round-trip tests, 218 end-to-end
functional tests, and a miscellany of special tests that add up to a
total of 502 reporting items.  Only around 20 of these were added
during the translation itself.  All these tests are run by continuous
integration on every commit.

That translation phase was completed In November 2019 when the Go code
first passed the full test suite. The two months since has been
sufficient to polish the literalistic mock-Python it was at that time
into what is now, I believe, pretty clean idiomatic Go.

795
796
797
798
799
800
801
== Pass-by-reference vs. pass-by-value ==

I think I can say now that once one has a translation from Python to
Go that compiles, the largest single sources of bugs is the difference
between Python pass-by-reference semantics for object and Go
pass-by-value.  Especially when iterating over lists.

Anthony Fok's avatar
Anthony Fok committed
802
Go "for i, node := range nodelist" looks very similar to Python
803
"for (i, node) in enumerate nodelist"; the gotcha is that Go's
Anthony Fok's avatar
Anthony Fok committed
804
pass-by-value semantics means that altering members of node
805
806
807
808
will *not* mutate the nodelist.

The fix isn't very difficult; this

Anthony Fok's avatar
Anthony Fok committed
809
[source,go]
810
-----------------------------------------------------------------
Anthony Fok's avatar
Anthony Fok committed
811
812
	for i := range nodelist {
		node := &nodelist[i]
813
814
815
816
		...
	}
-----------------------------------------------------------------

Anthony Fok's avatar
Anthony Fok committed
817
often suffices.
818
819

I don't have any recommended language change around this, as I don't
Anthony Fok's avatar
Anthony Fok committed
820
think Go's choice is wrong. I do think the fact that this relatively
821
822
minor issue is one of the larger translation barriers is interesting.

823
824
825
== Envoi: Actionable recommendations ==

These are in what I consider rough priority order.
826
827
828

1. Keyword arguments should be added to the language.

829
2. True iterators are felt by their absence and would be easy to add.
830

831
832
3. Any enlargement in the range of what can be declared const
   would be good for safety and expressiveness.
833

834
4. Yes, throw()/catch() needs to be writeable in the language.  Two
835
836
   minimal relaxations of compilation rules would make writing it
   possible.
837

838
839
840
841
5. A technical writer with an outside-in view of the language should
   be hired on to do an edit pass and reorganization of the documents.

6. Lookbehinds should be added to the regexp library.
842

843
7. If generics don't fly, a map-over-slice intrinsic should be added.
Eric S. Raymond's avatar
Eric S. Raymond committed
844
845
846

Not quite actionable yet:

847
848
* Absence of sum types creates an actual hole in the type-safety of
  the language.