Output to text streams other than real disk files goes unbuffered
Original Reporter info from Mantis: croco
Reporter name: Andrey "Croco" Stolyarov
Original Reporter info from Mantis: croco
- Reporter name: Andrey "Croco" Stolyarov
Output to text streams, including stdout, is done without buffering, unless it is redirected to a real disk file. This significantly raises the program's execution time.
Steps to reproduce:
To reliably reproduce the effect, one can actually use any program that
does text output (including the output to stdout) of a notable volume. To
make a simple test, a program that outputs numbers from 0 to 100000000 (100
millions), one per line, will do:
program million; var i: longint; begin for i:=0 to 100000000 do writeln(i) end.
100 mils is choosen because this takes measurable amount of time (tens of
seconds), but not too much in order not to fall asleep.
With the version of RTL present in current repositories, the following
measurements are observed. When the output goes to a plain disk file:
crocodil@frock:~/FPC$ time ./mil > file
So it took less than 29 seconds of real (wall clock) time, of which the
kernel had to spend 9.5 seconds. If the same thing is done using a
pipeline, the figures become different (please note the time command is
only applied to ./mil, not to the whole pipeline):
crocodil@frock:~/FPC$ time ./mil | cat > file
The "wall clock" time actually raised less than 2.5 times, but please note
the system time which is more that the "real". This means the poor kernel
had to utilize more than one CPU to serve this process and, roughly
speaking, spent 13 times more time.
This is because for the first run, the output (on the level of syscalls) is
done like this (the following are excerpts from what strace command shows):
write(1, "0\n1\n2\n3\n4\n5\n6\n7\n8\n9\n10\n11\n12\n13\n"..., 256) = 256
write(1, "\n89\n90\n91\n92\n93\n94\n95\n96\n97\n98\n9"..., 256) = 256
write(1, "5\n156\n157\n158\n159\n160\n161\n162\n16"..., 256) = 256
which seems reasonable (altough 256 is not the best buffer size -- well,
4096 is used by most of existing software; but 256 is still acceptable).
But in case the output goes anywhere but a real disk file, the output is
done like this:
write(1, "0\n", 2) = 2
write(1, "1\n", 2) = 2
write(1, "2\n", 2) = 2
write(1, "160\n", 4) = 4
write(1, "161\n", 4) = 4
write(1, "162\n", 4) = 4
write(1, "87337\n", 6) = 6
write(1, "87338\n", 6) = 6
write(1, "87339\n", 6) = 6
This means every writeln within the program causes its own system call
write(2), that is, there's no output buffering in effect. The funniest
thing here is that in case the output is redirected to /dev/null, it works
slower that for a disk file:
crocodil@frock:~/FPC$ time ./mil > /dev/null
This is ONLY because of total number of syscalls; every single syscall in
this case returns almost immediately, because there's nothing to do for it;
but the costs of switching contexts between the process and the kernel are
still in effect.
This all was not the case for the past releases, e.g., if we compile the
same program with fpc 3.2.0, we've got the following figures:
crocodil@frock:~/FPC$ time ./mil320 > file
crocodil@frock:~/FPC$ time ./mil320 | cat > file
crocodil@frock:~/FPC$ time ./mil320 > /dev/null
Please specifically note less than one second system time in the last case.
It looks like that this misbehaviour was introduced during the work on
ticket https://bugs.freepascal.org/view.php?id=37716 when the function
Do_IsDevice got changed despite that it had no relationship to the bug
reported in that ticket, and despite that I strolngly suggested not to do
this particular change.
The function Do_IsDevice was not broken, it just was (and still is)
misnamed. It was relatively safe not to touch it at all (as from SeekEof
it was called in error, and now in SeekEof there's no call to Do_IsDevice).
The "fix" for non-broken function lead to the obvious misbehaviour, as only
the streams that go to a human user may go unbuffered (and even these,
actually, should be buffered, but in a manner different from other
streams). Pipes, sockets and devices (like, e.g., /dev/null) are NOT of
this kind, only TTYs are; it is common practice in Unix world to pass
gigabytes of data through pipes and/or through standard input/output
streams, so unbuffering them all is a catastrophe, which can not even be
worked around anyhow from within the pascal program. Effectively this
makes Free Pascal totally unusable for implementing a large kind of
programs known as "unix filters".
Please note that before the "fix", Do_IsDevice was implemented as isatty(3)
-- that is, it issued ioctl(2) syscall to determine whether the stream is
connected to TTY or not.
It is very easy to fix this misbehaviour: roll back the Do_IsDevice
implementation to its previous state, and you're done.
What I also strongly suggest is to rename Do_IsDevice into Do_IsConsole or
the like, so that it will not do more confusion than it already did.