Julia: mmap with arbitrary offsets no more allowed

Created on 3 Aug 2018 · 13Comments · Source: JuliaLang/julia

I write a file in 2 parts, and intend to Mmap the second part. The size of the first part is not a multiple of 8 bytes:

julia> o = open("/tmp/x", "w")
IOStream(<file /tmp/x>)
julia> write(o, UInt8[1,2])  # maybe some header
2
julia> write(o, Float64[1,2,3]) # actual data
24
julia> close(o)

Now trying to mmap at an offset of 2 gives me:

julia> o = open("/tmp/x", "r+")
IOStream(<file /tmp/x>)

julia> Mmap.mmap(o, Array{Float64,1}, (3,), 2)
ERROR: ArgumentError: unsafe_wrap: pointer 0x7f9d852a4002 is not properly aligned to 8 bytes
Stacktrace:
 [1] #unsafe_wrap#56 at ./pointer.jl:84 [inlined]
 [2] unsafe_wrap at ./pointer.jl:84 [inlined]
 [3] #mmap#1(::Bool, ::Bool, ::Function, ::IOStream, ::Type{Array{Float64,1}}, ::Tuple{Int64}, ::Int64) at /home/shashi/code/julia/usr/share/julia/stdlib/v0.7/Mmap/src/Mmap.jl:226
 [4] mmap(::IOStream, ::Type{Array{Float64,1}}, ::Tuple{Int64}, ::Int64) at /home/shashi/code/julia/usr/share/julia/stdlib/v0.7/Mmap/src/Mmap.jl:188
 [5] top-level scope at none:0

Would this be considered a bug? If not, is there a certain in built way to deal with this?

Source

shashi

All 13 comments

Well, unless some architecture supports bit-by-bit operations, you need to align memory.

fisiognomico on 3 Aug 2018

No this is not a bug, Array has to be aligned. This does NOT require "bit-by-bit" operations though so other array types could do this. You can use https://github.com/JuliaArrays/UnalignedVectors.jl.

yuyichao on 3 Aug 2018

It might not be a bug but it is a major reduction in functionality. Thus needs a proper deprecation with the official workaround listed. I think the workaround should be included in the Mmap Module since without it, mmap functionality is severely reduced.

tknopp on 3 Aug 2018

You can use reinterpret for this now – that gives us enough information to know exactly how much to pessimize the load operation to account for the mis-aligned data. In most cases though, it's perhaps better to use read(io, T).

vtjnash on 3 Aug 2018

👍1

Could you please provide a MWE for using reinterpret? The suggestion to use read is to circumvent mmap alltogether, but whats wrong with mmap?

tknopp on 3 Aug 2018

Okay, I think I'll just align stuff by hand. Thanks everyone! :+1:

You can use reinterpret

You mean mmap as UInt8, then reinterpret as Float64? That wouldn't work would it? It reaches the same condition.

shashi on 3 Aug 2018

That wouldn't work would it? It reaches the same condition.

I believe that ReinterpretArrays allow this—which is one of the reasons they're a different type, so that they compiler can safely assume that normal arrays are aligned and emit less optimized but working code for reinterpreted arrays.

StefanKarpinski on 3 Aug 2018

xref: https://discourse.julialang.org/t/upgrade-to-0-7-issue-with-mmap/12965

I suggest that mmap returns a reinterpreted Array. Or in other words, this is a bug in mmap and the issue should be reopened.

tknopp on 7 Aug 2018

👎1 👍1

So mmap should guarantee bad performance (e.g. considerably worse than doing read!) by serving the lowest common denominator? I don't think we're going to make that the interface. Most users are probably better off avoiding mmap altogether, since it can be much slower than read and has various significant limitations (and will typically, at best, only be slightly faster).

vtjnash on 7 Aug 2018

This is breaking user code and I just asked for a solution. Using read is a lot more complicated right now, since I am using this behind HDF5. Mmaped arrays also give me simple slicing for free. With read I need to so this by hand. I will try this but please acknowledge that this broke HDF5 mmap support and that this needs to be fixed.

tknopp on 7 Aug 2018

With #28707 the performance concerns for a reinterpret-based fix of mmap seem to be addressed.

I still think that the current status of mmap failing unpredictably based on offset not good.

tknopp on 17 Aug 2018

👎1

With #28707 the performance concerns for a reinterpret-based fix of mmap seem to be addressed.

No it's reduced not gone. I hope you are not suggesting that we should drop all alignment assumption anywhere since there's no performance issue.

I still think that the current status of mmap failing unpredictably based on offset not good.

How is it "unpredictable"? It's very predictable based on the offset and in fact, it has been a undefined behavior in your code in all previous versions. It just happen that LLVM is nice enough to not fault on it (or at least not in a way that you noticed). Now it's more predictable than before.

Finally, it makes no sense to change this. Right now, you can implement both behavior easily but you are suggesting that the current behavior will not be implementable anymore. This will basically be rewarding people who write wrong code and hurting people who write the right one by reducing what could be done with the API.

yuyichao on 17 Aug 2018

How is it "unpredictable"?

It unpredictable for the user since the offset is usually a runtime value. One cannot know its value during coding but it depends on the file that is mmaped (see HDF5 code).

This will basically be rewarding people who write wrong code

This is a pretty harsh statement. If you think that the original HDF5 code is wrong than please provide the right one.

In the help string of Mmap.mmap there is no discussion that arbitrary offset are disallowed. Could you please fix this and throw an appropriate ArgumentError if you are so convinced that the current behavior is best? For me as a user it is not clear what conditions offset has to fulfill.

tknopp on 17 Aug 2018

Was this page helpful?

0 / 5 - 0 ratings