Rust: Reading a CString safely without overhead from Read

Created on 16 Mar 2019  路  11Comments  路  Source: rust-lang/rust

Hello everyone,

When I was inspecting libflate for unsafe code I found this piece of code:

fn read_cstring<R>(mut reader: R) -> io::Result<CString>
where
    R: io::Read,
{
    let mut buf = Vec::new();
    loop {
        let b = reader.read_u8()?;
        if b == 0 {
            return Ok(unsafe { CString::from_vec_unchecked(buf) });
        }
        buf.push(b);
    }
}

The Problem

If I am correct and didn't miss anything this function is completely safe. However because there is (for as far as I know) no functionality that can safely read a CString without performance overhead, the author probably felt forced to implement it himself.

I think reading CStrings from an object that implements Read is a pretty common operation when handling binary files, so it might be good to provide functionality in the standard library for doing so without sacrificing performance.

Currently two solutions have been presented.

Solution 1: Add a CString::from_reader method

One solution would be to add a CString::from_reader method as follows:

fn CString::from_reader(mut reader: impl Read) -> Result<CString, std::io::Error>
{
    let mut buffer = Vec::new();
    let mut character: u8 = 0;

    loop {
        let slice = std::slice::from_mut(&mut character);
        reader.read_exact(slice)?;

        // Check if a null character has been found, if so return the Vec as CString.
        if character == 0 {
            return Ok(unsafe { CString::from_vec_unchecked(buffer) });
        }

        // Push a new non-null character.
        buffer.push(character);
    }
}

Pro's and Con's:
+ Pro: CStrings can be read using a simple one liner if the source being read implements Read. &[u8] also implements Read so I think this is already quite flexible.
- Con: It only works when reading from objects that impl Read. I am not sure if there are any scenario's on which this would not be sufficient?

-> Playground example

Solution 2: Add way to convert from Vec<NonZeroU8> to CString

Credits to @alex

Another solution would be to add a conversion method (by for example using the From trait) for Vec<NonZeroU8> to CString. Since we can be sure that no zero characters are included in the Vector we could perform a cheap conversion. I took an attempt to implementing From for CString but it might be improved.

impl From<Vec<NonZeroU8>> for CString
{
    fn from(vector: Vec<NonZeroU8>) -> Self
    {
        unsafe {
            let vector = std::mem::transmute::<Vec<NonZeroU8>, Vec<u8>>(vector);
            CString::from_vec_unchecked(vector)
        }
    }
}

Pro's and Con's:
+ Pro: Everything that can be converted to a Vec<NonZeroU8> can be safely converted into a CString.
- Con: People would still have to read bytes from a Read````-able object manually and converting them into aVec``` in order for this to work.

-> Playground example

I wonder what you all think of this proposal and whether or not it could be improved.

A-ffi C-feature-request T-libs

Most helpful comment

@Shnatsel I completely agree! I think in most cases (depending on how Read::read is implemented) an implementation for BufRead is probably much faster.

I do however wonder: Wouldn't it be better to do it like this?

fn from_reader(mut reader: impl BufRead) -> Result<CString, std::io::Error>
{
    let mut buffer = Vec::new();
    reader.read_until(0, &mut buffer)?;
    if buffer.len() > 0 {
        return Ok(CString { inner: buffer.into_boxed_slice() });
    } else {
        return Ok(unsafe { CString::from_vec_unchecked(buffer) });
    }
}

This way we can avoid doing a buffer.pop and pushing a null character again in Cstring::from_vec_unchecked. But maybe the compiler would be able to optimize your version such that the performance would be the same.

All 11 comments

The second option seems very unlikely to be helpful; I can't think of any usecases for having a NonZeroU8 Vec beyond this API. Plus, converting to it would mean essentially implementing this same code - and then you could just use the from_vec_unchecked method.

I think the first proposal is quite reasonable though.

I think allowing to safely convert Vec<NonZeroU8> to CString is a good idea regardless. It's an obviously correct conversion and it would be a shame not to provide a safe interface for it.
I could see it being used for performance reasons in security-critical code where the authors would like to avoid rolling their own unsafe, since part of the appeal of NonZeroU8 is that you can make the check once and avoid any further checks down the road.

I wonder why read_until() is only implemented on BufRead, but not Read? There is probably a good reason for that - and if so, we should also implement CString::from_reader() only on BufRead. This should still serve most use cases because Cursor implements BufRead, and external I/O without buffering is unbearably slow so you'd wrap it in buffering anyway.

Here's what an implementation for BufRead could look like:

fn from_reader(mut reader: impl BufRead) -> Result<CString, std::io::Error>
{
    let mut buffer = Vec::new();
    reader.read_until(0, &mut buffer)?;
    if buffer.len() > 0 {
        // 0 has been read into the vector, pop it
        buffer.pop(); //TODO: consider in-place transmute
    } else {
        // no bytes have been read, nothing to do
    }
    return Ok(unsafe { CString::from_vec_unchecked(buffer) });
}

@Shnatsel I completely agree! I think in most cases (depending on how Read::read is implemented) an implementation for BufRead is probably much faster.

I do however wonder: Wouldn't it be better to do it like this?

fn from_reader(mut reader: impl BufRead) -> Result<CString, std::io::Error>
{
    let mut buffer = Vec::new();
    reader.read_until(0, &mut buffer)?;
    if buffer.len() > 0 {
        return Ok(CString { inner: buffer.into_boxed_slice() });
    } else {
        return Ok(unsafe { CString::from_vec_unchecked(buffer) });
    }
}

This way we can avoid doing a buffer.pop and pushing a null character again in Cstring::from_vec_unchecked. But maybe the compiler would be able to optimize your version such that the performance would be the same.

The implement PR is here: https://github.com/rust-lang/rust/pull/59314

I could create one for BufRead as well however. Should I call it from_bufread?

I had a discussion on the PR #59314 with @dtolnay and I decided to close the PR since it does not feel elegant enough yet in my opinion. I feel like there must be some smooth trait based solution. I hope to think of something better.

Using read_until is nice, although the suggested implementation is not sound, in @DevQps case, and may crop a non-null byte in @Shnatsel case; indeed, read_until may read contents without reaching the expected byte (when EOF is reached).

Here is a fixed (and performant) implementation:

/// # Safety
///
///   - `BufRead` implementor mut enforce its contract
unsafe
fn from_reader_unchecked (
    mut reader: impl BufRead,
) -> Result<CString, ::std::io::Error>
{
    let mut buffer = Vec::new();
    reader.read_until(b'\0', &mut buffer)?;
    if let Some(&b'\0') = buffer.last() {
        // # Safety
        //
        //   - no null bytes before `buffer.last()` (thanks to `.read_until()` contract)
        //
        //   - last byte has already been checked to be null
        CString::from_vec_unchecked(buffer)
    } else {
        buffer.reserve_exact(1);
        buffer.push(b'\0');
        // # Safety
        //
        //   - no null bytes before `buffer.last()` (thanks to `.read_until()` contract)
        //
        //   - terminating null byte has been appended
        CString::from_vec_unchecked(buffer)
    }
}

/// # Safety:
///
///   - `read_until` must enforce its contract.
unsafe trait TrustedBufRead : BufRead {}

#[inline]
fn from_reader (
    reader: impl TrustedBufRead,
) -> Result<CString, ::std::io::Error>
{
    unsafe { from_reader_unchecked(reader) }
}

@danielhenrymantilla @Shnatsel's implementation was correct to pop the null-byte, as CString::from_vec_unchecked will add it itself. The unsafety that from_vec_unchecked requires only that the Vec contains no null bytes at all, neither interior nor the final one. So it should be instead:

unsafe
fn from_reader_unchecked (
    mut reader: impl BufRead,
) -> Result<CString, ::std::io::Error>
{
    let mut buffer = Vec::new();
    reader.read_until(b'\0', &mut buffer)?;
    if let Some(&b'\0') = buffer.last() {
        buffer.pop()
        // # Safety
        // last was the first null byte encountered, it's been removed.
        CString::from_vec_unchecked(buffer)
    } else {
        // # Safety
        //  no null bytes before `buffer.last()` (thanks to `.read_until()` contract)
        CString::from_vec_unchecked(buffer)
    }
}

It is besides the question of safety and performance, but shouldn't from_reader_unchecked rather raised UnexpectedEof if it does not encounter a null-byte at all?

This is addressed by https://github.com/rust-lang/rust/pull/64069 and should ship in 1.43

The new trait impl addressing this has shipped in 1.43, so this issue can be closed.

Was this page helpful?
0 / 5 - 0 ratings