My table has a primary key of type char(11)
. When querying this table, Diesel sends over a parameterized query and a value of type text
instead of bpchar
. This forces a slow sequential scan instead of doing an index-only scan. When running a similar query from psql, PostgreSQL uses the index.
Find an indexed char(n)
quickly
results in tens of milliseconds
results in 1+ second
No
I think all you need to observe this is an indexed char(n)
column, a .find(
or .filter(id.eq(
in Diesel, and auto_explain
on the server side to confirm sequential scans. But I have a complete test case set up to demonstrate this:
For the sake of completeness, I did this on a new Debian 9.5:
apt-get update
apt-get install --no-install-recommends libpq-dev curl git gcc libc-dev
adduser user
su user
curl https://sh.rustup.rs -sSf | sh
cargo install diesel_cli --no-default-features --features postgres
then
su postgres
createuser -S -D -R -P diesel_bpchar
createdb -O diesel_bpchar diesel_bpchar
git clone https://github.com/ludios/diesel_bpchar
cd diesel_bpchar
cargo build
echo "DATABASE_URL=postgres://diesel_bpchar:PASS@localhost:5432/diesel_bpchar" > .env
diesel setup
diesel migration run
for i in {00000000001..0005000000}; do echo $i >> /tmp/ids; done
psql postgres://diesel_bpchar:PASS@localhost:5432/diesel_bpchar -c "COPY videos FROM STDIN" < /tmp/ids
What should be a quick index lookup instead takes 1 second:
# time ./target/debug/find_video 00005000000
Found 1 videos
./target/debug/find_video 00005000000 0.01s user 0.00s system 1% cpu 0.998 total
With these postgresql server settings, you can see what is happening on the server side:
shared_preload_libraries = 'auto_explain' # (change requires restart)
auto_explain.log_min_duration = 0
auto_explain.log_analyze = true
auto_explain.log_verbose = true
log_statement = 'all' # none, ddl, mod, all
(note: auto_explain
is in the postgresql-contrib
package on Debian.)
When Diesel sends over its parameterized query, PostgreSQL logs a sequential scan:
2018-11-10 17:35:00.414 UTC [29344] diesel_bpchar@diesel_bpchar LOG: execute __diesel_stmt_0: SELECT "videos"."id" FROM "videos" WHERE "videos"."id" = $1
2018-11-10 17:35:00.414 UTC [29344] diesel_bpchar@diesel_bpchar DETAIL: parameters: $1 = '00005000000'
2018-11-10 17:35:01.390 UTC [29344] diesel_bpchar@diesel_bpchar LOG: duration: 976.167 ms plan:
Query Text: SELECT "videos"."id" FROM "videos" WHERE "videos"."id" = $1
Seq Scan on public.videos (cost=0.00..102028.00 rows=25000 width=12) (actual time=976.154..976.154 rows=1 loops=1)
Output: id
Filter: ((videos.id)::text = '00005000000'::text)
Rows Removed by Filter: 4999999
Run a similar query from psql
and PostgreSQL logs an Index Only Scan
:
# time psql postgres://diesel_bpchar:PASS@localhost:5432/diesel_bpchar -c "SELECT videos.id FROM videos WHERE videos.id = '00005000000';"
id
-------------
00005000000
(1 row)
psql postgres://diesel_bpchar:PASS@localhost:5432/diesel_bpchar 0.03s user 0.01s system 91% cpu 0.041 total
2018-11-10 17:39:55.150 UTC [31473] diesel_bpchar@diesel_bpchar LOG: statement: SELECT videos.id FROM videos WHERE videos.id = '00005000000';
2018-11-10 17:39:55.150 UTC [31473] diesel_bpchar@diesel_bpchar LOG: duration: 0.033 ms plan:
Query Text: SELECT videos.id FROM videos WHERE videos.id = '00005000000';
Index Only Scan using videos_pkey on public.videos (cost=0.43..8.45 rows=1 width=12) (actual time=0.028..0.028 rows=1 loops=1)
Output: id
Index Cond: (videos.id = '00005000000'::bpchar)
Heap Fetches: 1
Possible workarounds:
Text
if possibleSqlType
that maps to Char(n)
. See this testcase for an example (Just replace that enum by string, also you probably want to use #[postgres(oid = "18", array_oid = "1002")]
instead of #[postgres(type_name = "char")]
).Interesting, thanks for that SqlType
workaround. I was using .filter(sql(&format!("id = '{}'", id)))
, which required me to deal with SQL injection.
FWIW, I ran in to this exact problem, and an additional warning was added to the PostgreSQL wiki's "Don't Do This" list: https://wiki.postgresql.org/wiki/Don%27t_Do_This#Why_not.3F_12
I switched to TEXT
with a constraint. I suppose it's still the right thing for diesel to send bpchar
in this case.
There's a few deeper issues here. The history behind all of this is a bit scattered, so let me give some recap/context here.
A very long time ago, Diesel had Varchar
and Text
as separate types. This was painful for a variety of reasons. While they are treated as separate types in PG (though varchar actually isn't anymore, if you explain any query that involves varchar
, you'll see it's always unconditionally cast to text
), PG has implicit coercions that make them mostly interchangeable.
However, we can't really model those coercions easily in Diesel, due to a combination of limitations of Rust, and our architecture. There's no easy way for us to say impl<T: AsExpression<Text>> AsExpression<VarChar> for T
. So this would lead to really funky things like being unable to write text_col.eq(varchar_col)
, having functions like lower
only work with one type but not the other, and a ton of duplication in our own codebase.
We fixed this in #288 by just making varchar a type alias. This was a workaround, not a fix for the root problem. We've since encountered similar issues that cannot use this workaround. Numeric types are a major case where SQL coerces, but we cannot (it's much less common for this to be painful in practice though). Timestamp vs timestamptz is another case where the types coerce between each other, but do not have identical semantics, and our "just use a type alias" solution would not work.
However, strings are special. This really does (mostly) work for pretty much all string types. These types all have identical representation (contrary to what some folks believe, using char
, varchar
, bpchar
, etc have no benefit over using text
. They're all just bytea
under the hood). Additionally, all these types have the same set of Rust types that correspond to them. String
and str
.
However, this has come back to bite us in a few ways. The first one we encountered was that these coercions do not apply to arrays. You cannot use text[]
where a varchar[]
is expected. This is why we warn in diesel print-schema
and friends if we encounter a column of that type.
The second issue we encountered is that text
always "wins" in these coercions. When PG encounters some_string_col = ''::text
or ''::text = some_string_col
, it will cast some_string_col
to text, rather than the text to whatever type that column is. We first ran into this when we attempted to use this same type alias workaround to citext
. When we sent a bind parameter as text
, it would do a case sensitive comparison rather than an insensitive one. The case here is similar.
With all that said, I still don't really know how we can fix the "root" problem here. The best I've ever come up with is adding explicit casts for this, which at least gives us an out, but would still make splitting up the string types extremely painful.
That said, I do still believe that strings are "special", and I think we might be able to improve the situation here for specifically strings. You don't actually have to know the OID of the type you're transmitting in every case, you can also send 0, which means "interpret this as an untyped string literal". That might actually be exactly what we want for bind parameters sent for a string type. (Note: an implementation doing this will not be as simple as just changing the OID of Text
to 0
. There are still cases where we need the actual OID, particularly when sending it as part of an array or composite type)
Most helpful comment
There's a few deeper issues here. The history behind all of this is a bit scattered, so let me give some recap/context here.
A very long time ago, Diesel had
Varchar
andText
as separate types. This was painful for a variety of reasons. While they are treated as separate types in PG (though varchar actually isn't anymore, if you explain any query that involvesvarchar
, you'll see it's always unconditionally cast totext
), PG has implicit coercions that make them mostly interchangeable.However, we can't really model those coercions easily in Diesel, due to a combination of limitations of Rust, and our architecture. There's no easy way for us to say
impl<T: AsExpression<Text>> AsExpression<VarChar> for T
. So this would lead to really funky things like being unable to writetext_col.eq(varchar_col)
, having functions likelower
only work with one type but not the other, and a ton of duplication in our own codebase.We fixed this in #288 by just making varchar a type alias. This was a workaround, not a fix for the root problem. We've since encountered similar issues that cannot use this workaround. Numeric types are a major case where SQL coerces, but we cannot (it's much less common for this to be painful in practice though). Timestamp vs timestamptz is another case where the types coerce between each other, but do not have identical semantics, and our "just use a type alias" solution would not work.
However, strings are special. This really does (mostly) work for pretty much all string types. These types all have identical representation (contrary to what some folks believe, using
char
,varchar
,bpchar
, etc have no benefit over usingtext
. They're all justbytea
under the hood). Additionally, all these types have the same set of Rust types that correspond to them.String
andstr
.However, this has come back to bite us in a few ways. The first one we encountered was that these coercions do not apply to arrays. You cannot use
text[]
where avarchar[]
is expected. This is why we warn indiesel print-schema
and friends if we encounter a column of that type.The second issue we encountered is that
text
always "wins" in these coercions. When PG encounterssome_string_col = ''::text
or''::text = some_string_col
, it will castsome_string_col
to text, rather than the text to whatever type that column is. We first ran into this when we attempted to use this same type alias workaround tocitext
. When we sent a bind parameter astext
, it would do a case sensitive comparison rather than an insensitive one. The case here is similar.With all that said, I still don't really know how we can fix the "root" problem here. The best I've ever come up with is adding explicit casts for this, which at least gives us an out, but would still make splitting up the string types extremely painful.
That said, I do still believe that strings are "special", and I think we might be able to improve the situation here for specifically strings. You don't actually have to know the OID of the type you're transmitting in every case, you can also send 0, which means "interpret this as an untyped string literal". That might actually be exactly what we want for bind parameters sent for a string type. (Note: an implementation doing this will not be as simple as just changing the OID of
Text
to0
. There are still cases where we need the actual OID, particularly when sending it as part of an array or composite type)