Zig: Odd bug in std.mem.concat only on Windows.

Created on 21 Oct 2020  路  13Comments  路  Source: ziglang/zig

I found this out while trying to use ZPM to add a package. This is not a ZPM bug as I found, but rather a bug in Zig's std.mem.concat.

Below is a simplified version of the error

const std = @import("std");

pub fn main() !void {
    const name = try std.io.getStdIn().reader().readUntilDelimiterAlloc(std.heap.page_allocator, '\n', 512);
    defer std.heap.page_allocator.free(name);
    std.debug.warn("\n", .{});

    var path = try std.mem.concat(std.heap.page_allocator, u8, &[_][]const u8{
        "packages/",
        name,
        ".json",
    });

    std.debug.warn("{}\n", .{path});
}

Input: myPackage
Output on Ubuntu 20.04: packages/myPackage.zig
Output on Windows: .jsonges/myPackage.zig

This means that the ".json" is overwriting the beginning of the buffer. What's weirder is that when iterating through the entire buffer when copying, every character is printed as the correct string packages/myPackage.zig but the returned result is the other bugged version.

standard library

Most helpful comment

I think detection should be the main feature since you could make it explicit with the existing untilDelimiter call

That's why I suggested it should remove both CRLF and LF endings, this way you get a simple method that reads a platform-independent line.

And readUntilDelimiterAlloc accepts a single byte as delimiter, reading CRLF requires you to read the line and cut off the trailing CR.

All 13 comments

fn myConcat(allocator: *std.mem.Allocator, comptime T: type, slices: []const []const T) ![]T {
    if (slices.len == 0) return &[0]T{};

    var total_len: usize = 0;
    for (slices) |slice| {
        total_len += slice.len;
    }

    const buf = try allocator.alloc(T, total_len);
    errdefer allocator.free(buf);

    var buf_index: usize = 0;

    var sliceIdx: usize = 0;
    while(sliceIdx < slices.len): (sliceIdx += 1) {
        var i: usize = 0;
        while(i < slices[sliceIdx].len): (i += 1){
            buf[buf_index] = slices[sliceIdx][i];
            buf_index += 1;
            std.debug.print("{c}\n",.{buf[buf_index - 1]});
        }
    }

    return buf[0..];
}

This was my attempt to solve it. The debug print writes everything correctly and obviously buf_index is never reset to the beginning for this bug to occur.

Odd bug

It's not a bug, it's a Windows feature :) readUntilDelimiterAlloc is reading until a \n is found... but Windows uses \r\n for its newlines! When you read in myPackage.zig what you really get is myPackage.zig\r and when the full path is built (ps. use fs.path.joinSep) you get something like packages/myPackage.zig\r.json.

Whenever you print the string to a terminal the \r shifts the cursor back to the beginning of the line, that's why the .json part is misplaced.

That both explains it and also makes it more confusing. Perhaps we could create a new method like readUntilNewLineAlloc or even more simply readLineAlloc that handled it on a per-platform basis for simplicity?

This is why Zig needs a {q} format specifier. 馃榿

Perhaps we could create a new method like readUntilNewLineAlloc or even more simply readLineAlloc that handled it on a per-platform basis for simplicity?

I like the idea of a readLineAlloc (what about readLineArrayList? and readLineOrEof?) but it should either recognize and chop away both CRLF and LF sequences or take an explicit parameter to control this behaviour, you can have files with LF endings on Windows and files with CRLF endings on Linux.

I think detection should be the main feature since you could make it explicit with the existing untilDelimiter call

I think detection should be the main feature since you could make it explicit with the existing untilDelimiter call

That's why I suggested it should remove both CRLF and LF endings, this way you get a simple method that reads a platform-independent line.

And readUntilDelimiterAlloc accepts a single byte as delimiter, reading CRLF requires you to read the line and cut off the trailing CR.

There's already a comptime delimiter constant in the standard library that could be used to drive a readLineAlloc or whatever

40+ years of experience with handling text files cross-platform speaking here: "On all platforms, the standard read-a-line functions should always consider CR, LF, and CRLF to be line terminators (and yes, that means that if you see a CR, also eat a single LF if one appears immediately after it). Also, keep in mind that there might not be any of these characters at the end of the last line in a text file." Doing it this way means that all programs will automatically handle text files that have been transferred from disparate OSes by disparate means.

Now, get off my lawn.

should always consider CR, LF, and CRLF to be line terminators

It was before Unicode era. :)
From https://en.wikipedia.org/wiki/Newline#Unicode:

The Unicode standard defines a number of characters that conforming applications should recognize as line terminators:
LF: Line Feed, U+000A
VT: Vertical Tab, U+000B
FF: Form Feed, U+000C
CR: Carriage Return, U+000D
CR+LF: CR (U+000D) followed by LF (U+000A)
NEL: Next Line, U+0085
LS: Line Separator, U+2028
PS: Paragraph Separator, U+2029

Indeed! Note that the Unicode list of line terminators is a strict superset of mine, including the CR/LF/CRLF complication. And, hey, even back in the day we had FF (which was seldom seen, and typically caused a "clear screen" operation when displayed, and an "eject page" when printed), and even VT (even more infrequent in the wild, and nobody knew what it was really supposed to do, particularly as differentiated from LF). But NEL, LS, PS? You kids these days, with your fancy emo-gees.

But NEL, LS, PS? You kids these days, with your fancy emo-gees.

E.g., D supports them.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

andrewrk picture andrewrk  路  3Comments

jayschwa picture jayschwa  路  3Comments

daurnimator picture daurnimator  路  3Comments

andrewrk picture andrewrk  路  3Comments

fengb picture fengb  路  3Comments