Zig: send patch to Linux to fix std.cache_hash isProblematicTimestamp false negative

Created on 18 Aug 2020  路  7Comments  路  Source: ziglang/zig

In a916f63940ed0cf039ab5f70fbee022f604995ea I fixed an obvious bug which uncovered a nefarious latent bug:

220/1502 cache_hash.test "std-native-Debug-c-multi check that changing a file makes cache fail"...expected null, found GLuqBm2tKRkCtIQkzLj14o2jcZqFlknF
/drone/src/lib/std/testing.zig:137:36: 0x4ec287 in testing.expectEqual (test)
                    std.debug.panic("expected null, found {}", .{actual_payload});
                                   ^
/drone/src/lib/std/cache_hash.zig:596:28: 0x3c963f in cache_hash.test "std-native-Debug-c-multi check that changing a file makes cache fail" (test)
        testing.expectEqual(@as(?[32]u8, null), try ch.hit());
                           ^
/drone/src/lib/std/special/test_runner.zig:56:28: 0x44189f in std.special.main (test)
        } else test_fn.func();
                           ^
/drone/src/lib/std/start.zig:252:37: 0x4432a7 in start.main (test)
            const result = root.main() catch |err| {
                                    ^
/drone/src/lib/libc/musl/src/env/__libc_start_main.c:94:7: 0xdfd90f in libc_start_main_stage2 (/drone/src/lib/libc/musl/src/env/__libc_start_main.c)
    exit(main(argc, argv, envp));
      ^

Tests failed. Use the following command to reproduce the failure:
/drone/src/zig-cache/o/7O-hvqSImgvR7hUjZnC1Q3KhHiiX6mD6rKD9pyn_tjYSpznCIZWAaHyAEa6I31Sh/test
The following command exited with error code 255:
/drone/src/build/zig test /drone/src/lib/std/std.zig --library c --test-name-prefix std-native-Debug-c-multi  --cache-dir /drone/src/zig-cache --name test -I /drone/src/test --override-lib-dir /drone/src/lib 

I tried to fix it with ce8b9c0c5cdbe4161952e6f2aa875f722949d4cb but it's still happening sometimes.

I think it's time to step back and figure out how this can work properly. As it stands, isProblematicTimestamp is not reliable. Empirically, I tried this code:

const std = @import("std");
const fs = std.fs;

pub fn main() !void {
    const cwd = fs.cwd();
    const temp_file = "hello.txt";
    var timer = try std.time.Timer.start();
    var count: usize = 0;
    try cwd.writeFile(temp_file, temp_file);
    var mtime = try getmtime(temp_file);
    while (true) {
        _ = timer.lap();
        try cwd.writeFile(temp_file, temp_file);
        const new_mtime = try getmtime(temp_file);
        if (new_mtime != mtime) {
            mtime = new_mtime;
            break;
        }
    }
    while (true) {
        try cwd.writeFile(temp_file, temp_file);
        count += 1;
        const new_mtime = try getmtime(temp_file);
        if (new_mtime != mtime) {
            break;
        }
    }
    const ns = timer.lap();
    std.debug.print("count={} ns={}\n", .{ count, ns });
}

fn getmtime(temp_file: []const u8) !i128 {
    const file = try fs.cwd().openFile(temp_file, .{});
    defer file.close();
    const stat = try file.stat();
    return stat.mtime;
}

Got output like this:

[nix-shell:~/Downloads/zig/build]$ ./zig build-exe test.zig --release-fast

[nix-shell:~/Downloads/zig/build]$ ./test
count=12 ns=1768757

[nix-shell:~/Downloads/zig/build]$ ./test
count=12 ns=1010152

[nix-shell:~/Downloads/zig/build]$ ./test
count=13 ns=1874677

[nix-shell:~/Downloads/zig/build]$ ./test
count=2 ns=1005522

[nix-shell:~/Downloads/zig/build]$ ./test
count=13 ns=1834186

[nix-shell:~/Downloads/zig/build]$ ./test
count=13 ns=1071032

[nix-shell:~/Downloads/zig/build]$ ./test
count=2 ns=1013771

The meaning of this is that over the span of about 1 millisecond, I was able to write and then read the same file 13 times without the mtime changing, despite the file system purporting to have nanosecond granularity. Womp womp.

bug contributor friendly os-linux stage2 standard library upstream

Most helpful comment

Thanks for this deep dive, @heidezomp! The 10ms find is a big deal, and I think it will be worth adding a linux-specific adjustment to isProblematicTimestamp to take into account this finding. I think I will still leave this issue open as a passive aggressive way to complain about Linux's API here

All 7 comments

The best reference on mtime weirdness I know is the blog post mtime comparison considered harmful by apenwarr. The author's conclusion is that mtimes alone cannot be used reliably to track whether a file has been changed, and that a build system should not only track mtimes but also file size, inode, mode etc., and rebuild if any of them change. Regardless of whether you agree with this conclusion, the article contains some useful information on mtime shenanigans across different operating systems and filesystems.

That article was the main inspiration for Zig's cache system :-)

This issue is left open for someone to send a patch to Linux to better support this use case.

What I've been able to gather about Linux and mtimes so far:

I first tried to run the program @andrewrk provided. For me this yields the following output:

bram@acerola:~/Code/tmp$ zig run -Drelease-fast zig-6082-andrew.zig
count=13 ns=4300889
bram@acerola:~/Code/tmp$ zig run -Drelease-fast zig-6082-andrew.zig
count=16 ns=4291541
bram@acerola:~/Code/tmp$ zig run -Drelease-fast zig-6082-andrew.zig
count=17 ns=4308449
bram@acerola:~/Code/tmp$ zig run -Drelease-fast zig-6082-andrew.zig
count=16 ns=4309373

So for me the span of time over which the mtime remains the same is roughly 4.3ms, instead of Andrew's 1ms. Even worse!

I then modified the program to print the actual mtimes of two subsequent mtime changes, and their difference:

const std = @import("std");
const fs = std.fs;

pub fn main() !void {
    const cwd = fs.cwd();
    const temp_file = "hello.txt";
    try cwd.writeFile(temp_file, temp_file);
    const mtime = try getmtime(temp_file);
    const new_mtime = while (true) {
        try cwd.writeFile(temp_file, temp_file);
        const new_mtime = try getmtime(temp_file);
        if (new_mtime != mtime) {
            break new_mtime;
        }
    } else unreachable;
    std.debug.print(
        "old mtime: {}\nnew mtime: {}\ndifference: {}\n",
        .{ mtime, new_mtime, new_mtime - mtime },
    );
}

fn getmtime(temp_file: []const u8) !i128 {
    const file = try fs.cwd().openFile(temp_file, .{});
    defer file.close();
    const stat = try file.stat();
    return stat.mtime;
}

This gave me the following output:

bram@acerola:~/Code/tmp$ zig run -Drelease-fast zig-6082-print-mtime.zig
old mtime: 1597837894933562384
new mtime: 1597837894937562443
difference: 4000059
bram@acerola:~/Code/tmp$ zig run -Drelease-fast zig-6082-print-mtime.zig
old mtime: 1597837895985577985
new mtime: 1597837895989578044
difference: 4000059
bram@acerola:~/Code/tmp$ zig run -Drelease-fast zig-6082-print-mtime.zig
old mtime: 1597837896781589786
new mtime: 1597837896785589845
difference: 4000059
bram@acerola:~/Code/tmp$ zig run -Drelease-fast zig-6082-print-mtime.zig
old mtime: 1597837897505600517
new mtime: 1597837897509600577
difference: 4000060
bram@acerola:~/Code/tmp$ irb
irb(main):001:0> 1000000000.0/4000060
=> 249.99625005624915

Almost exactly 4ms difference each time, but the difference unfortunately is not exact (nor the same each time, as the last run demonstrates), so we cannot assume that the difference between problematic mtimes will be some exact number. As you can see this value corresponds to 250Hz, which turns out to be a Linux kernel configuration parameter:

bram@acerola:~/Code/tmp$ grep 'CONFIG_HZ=' /boot/config-$(uname -r)
CONFIG_HZ=250

So my CONFIG_HZ is set to 250, and from the output that Andrew provided, it seems like on his system it is set to 1000. Unfortunately there does not seem to be a way to get this value on a running system, apart from extracting it from the kernel config, and kernels can apparently also be compiled to be "tickless", in which case all of this does not apply.
More info on CONFIG_HZ is provided by this excerpt from man 7 time:

The software clock, HZ, and jiffies
The accuracy of various system calls that set timeouts, (e.g., select(2), sigtimedwait(2)) and measure CPU time (e.g., getrusage(2)) is limited by the resolution of the software clock, a clock maintained by the kernel which measures time in jiffies. The size of a jiffy is determined by the value of the kernel constant HZ. The value of HZ varies across kernel versions and hardware platforms. On i386 the situation is as follows: on kernels up to and including 2.4.x, HZ was 100, giving a jiffy value of 0.01 seconds; starting with 2.6.0, HZ was raised to 1000, giving a jiffy of 0.001 seconds.
Since kernel 2.6.13, the HZ value is a kernel configuration parameter and can be 100, 250 (the default) or 1000, yielding a jiffies value of, respectively, 0.01, 0.004, or 0.001 seconds. Since kernel 2.6.20, a further frequency is available: 300, a number that divides evenly for the common video frame rates (PAL, 25 HZ; NTSC, 30 HZ).

So in conclusion, the problems here are that

  1. Linux does not set an mtime rounded with zeroes as other systems apparently do;
  2. there is no way to get the correct mtime granularity (CONFIG_HZ) at runtime short of grepping through the kernel config;
  3. the granularity of the mtime (if there even is one) can differ depending on kernel configurations, so it cannot be hardcoded. At worst it will be 100Hz, i.e. 10ms.

To me, the most robust solution (yielding the least false positives) would seem to be to always round the mtime on Linux to multiples of 10ms for the purposes of isProblematicTimestamp. I don't know whether this is a desired solution.

Sources:

Bonus content: Note that it _is_ possible to actually write mtimes with nanosecond precision using the utimensat syscall, as demonstrated by the following program which writes a file, reads the mtime, adds one nanosecond to the mtime using utimensat and then reads the mtime again:

const std = @import("std");
const fs = std.fs;

usingnamespace @cImport({
    @cDefine("UTIME_OMIT", "((1l << 30) - 2l)"); // TODO add this to Zig std lib
});

pub fn main() !void {
    const cwd = fs.cwd();
    const temp_file = "hello.txt";
    try cwd.writeFile(temp_file, temp_file);
    const mtime = try getmtime(temp_file);
    if (std.os.linux.utimensat(cwd.fd, temp_file, &[2]std.os.linux.timespec{
        .{ .tv_sec = undefined, .tv_nsec = UTIME_OMIT }, // atime
        .{
            .tv_sec = @intCast(isize, @divTrunc(mtime, std.time.ns_per_s)),
            .tv_nsec = @intCast(isize, @rem(mtime, std.time.ns_per_s) + 1),
        }, // mtime
    }, 0) != 0) return error.Unexpected;
    const new_mtime = try getmtime(temp_file);
    std.debug.print(
        "old mtime: {}\nnew mtime: {}\ndifference: {}\n",
        .{ mtime, new_mtime, new_mtime - mtime },
    );
}

fn getmtime(temp_file: []const u8) !i128 {
    const file = try fs.cwd().openFile(temp_file, .{});
    defer file.close();
    const stat = try file.stat();
    return stat.mtime;
}

Output:

bram@acerola:~/Code/tmp$ zig run -Drelease-fast zig-6082-utimensat.zig
old mtime: 1597840942975805318
new mtime: 1597840942975805319
difference: 1
bram@acerola:~/Code/tmp$ zig run -Drelease-fast zig-6082-utimensat.zig
old mtime: 1597840943719815655
new mtime: 1597840943719815656
difference: 1
bram@acerola:~/Code/tmp$ zig run -Drelease-fast zig-6082-utimensat.zig
old mtime: 1597840944387824943
new mtime: 1597840944387824944
difference: 1

I have no idea why Linux does not do this itself. Maybe querying a high-resolution timer to write an exact mtime is considered too slow for normal usage?

From Wikipedia:
Filesystem | Date resolution|
-----------|----------------|
Btrfs |1 ns|
Ext3 |1 s |
Ext4 |1 ns|
NTFS |100 ns|

Thanks for this deep dive, @heidezomp! The 10ms find is a big deal, and I think it will be worth adding a linux-specific adjustment to isProblematicTimestamp to take into account this finding. I think I will still leave this issue open as a passive aggressive way to complain about Linux's API here

Was this page helpful?
0 / 5 - 0 ratings