Ecto: Dynamic Repo configuration

Created on 24 Feb 2017 · 43Comments · Source: elixir-ecto/ecto

i have a use case where i'm unable to configure my database ahead of time via creating a Repo module. i know it's possible to compile at runtime but i'm uneasy about the possible security and performance implications of doing so. what i'd like is an alternative to Repo that accepts configuration at runtime

repo = [
  adapter: Ecto.Adapters.Postgres,
  database: "ecto",
  ...
]
DynamicRepo.all(repo, query, opts)

DynamicRepo would be identical to modules generated via using Ecto.Repo apart from the arity change of functions to accept repo options

this is doable now as a library but i'm unsure how much of a Repo's api is considered public and how much is considered private. if private, it may be beneficial to make the functionality part of ecto itself so the api's can be updated in tandem

i'm also open to alternatives if there is a better path forward

Advanced Discussion

Source

talentdeficit

Most helpful comment

We were able to remove the :name requirement. :)

I believe we have tackled everything required to make this work. The last step is to define an official API for pid-based repos but we would like to see more testing and get more feedback for doing so.

For those who want to give it a try, you can look at the callbacks defined by lib/ecto/repo.ex and define your own dynamic repo. The idea is that you get the same implementation as the one generated by lib/ecto/repo.ex, except you replace __MODULE__ by an explicit pid. For example, you can define in your own a DynamicRepo with insert, update, delete like below:

defmodule MyApp.DynamicRepo do
  def insert(pid, struct, opts \\ []) do
    Ecto.Repo.Schema.insert(pid, struct, opts)
  end

  def update(pid, struct, opts \\ []) do
    Ecto.Repo.Schema.update(pid, struct, opts)
  end

  def delete(pid, struct, opts \\ []) do
    Ecto.Repo.Schema.delete(pid, struct, opts)
  end
end

Now to use it, you should start a non-dynamic repo, but forcing a nil name (or picking up another name):

{:ok, pid} = MyApp.Repo.start_link(name: nil)
MyApp.DynamicRepo.insert(pid, %Post{})

And that should be it.

If you want to connect to different databases and this kind of stuff, just pass the options to start_link/1.

josevalim on 29 May 2018

❤13 🎉4 👍1

All 43 comments

@talentdeficit have you tried the init callback we have recently added in Ecto 2.1? I believe from that callback you should be able to retrieve the configuration dynamically. So you configure the bare minimum in the config/config.exs file and the remaining options are configured on start.

There are probably some configurations that we are still using statically, such as the pool configuration in #1953 and there are probably things that are based on the name, such as ets tables that would avoid multiple starts, but with your help we can identify those and improve the code today.

What do you think?

josevalim on 24 Feb 2017

I was thinking about this as well. What if we had the Ecto.Repo (or some other module) export functions exactly as those generated in user repos, but accepting an additional argument - a config struct. For the regular repos, we could generate this config struct based on the env and delegate the implementation, similar to what we're already doing today with multiple modules in the Ecto.Repo.* namespace. If somebody needs dynamic access to the repo, they could build this struct manually and call the extended functions directly.

This would make the ecto interface function-first with the macro generation of use Ecto.Repo a mere convenience for removing the boilerplate of passing the config struct everywhere manually.

michalmuskala on 24 Feb 2017

@michalmuskala it is not that straight-forward because you still need to start a supervision tree, the pool and the processes that connect to the database. You cannot simply pass a config to Ecto.Repo.insert and have it connect to the database, that would be inefficient and its processes wouldn't be properly supervised.

josevalim on 24 Feb 2017

@michalmuskala although your description would be correct if, instead of passing of a config, we pass a pid of a previously started repo.

josevalim on 24 Feb 2017

Yes, you'd also have a start_link or a child_spec where you'd pass the same config to start the "dynamic" repo - you could do this multiple times (with different configs) to start multiple "dynamic" repos.

A process pid works as well.

michalmuskala on 24 Feb 2017

@michalmuskala I don't think it is possible to skip all of the steps you want to skip. For example, what about Repo.query! that are specific only to SQL backends? We need to know when to generate it. And again, you cannot pass a struct to insert, the pid is the only option that works. Once you consider a separate repo definition because of adapter concerns and factor in the need to supervise, you end-up with something similar to the current proposal.

josevalim on 24 Feb 2017

@josevalim i believe with the init callback i still need to know which Repo's i'll need up front. i can change the config for a repo, but i can't create a new repo unless i'm missing something

talentdeficit on 24 Feb 2017

@talentdeficit you can have a single Repo that works for everything that talks to Postgres. And start multiple instances of it connected to different backends.

josevalim on 24 Feb 2017

👍1

@josevalim i'm confused. how does the repo know what database to send the request to?

talentdeficit on 24 Feb 2017

Using init/2. :)

josevalim on 24 Feb 2017

Or you can even pass the parameters on start_link: Repo.start_link(hostname: ..., database: ...)`.

josevalim on 24 Feb 2017

i'm clearly missing something here

foo = Repo.start_link(foo_opts)
bar = Repo.start_link(bar_opts)

query = ...

Repo.all(query, opts)

how does Repo know how to dispatch?

talentdeficit on 24 Feb 2017

@talentdeficit that's the other part that needs to be solved 😅 In the past we have supported an option to store it in the process dictionary (since the pool information was stored there too).

josevalim on 24 Feb 2017

oh i'm sorry, i thought you meant this was possible now.

how about adapting @michalmuskala's suggestion and adding concrete functions to Repo that can be called with reference (pid or registered name) to the repo instance?

talentdeficit on 24 Feb 2017

@talentdeficit we could do so... BUT then all the functions that are adapter specific, such as Repo.query!, to_sql, etc won't work.

josevalim on 24 Feb 2017

they could just throw a undef error if the adapter doesn't support the function? you don't get compile time warnings but you get the same functionality/behavior as you would with a static repo. i think that is an ok tradeoff?

talentdeficit on 24 Feb 2017

@talentdeficit that's what would ultimately happen, yes. It is worth that the functions generated by Ecto.Repo work almost all on the way you are proposing, it is a matter of exposing them at the Ecto.Repo level. A proof of concept could start exactly by invoking those private functions for now until we have all details in place.

josevalim on 24 Feb 2017

The only non-dynamic part is the ecto query cache API which should also be fixable with the current registry. Then it is a matter of exposing the proper API (and agree where it should be exposed).

josevalim on 11 Mar 2017

@josevalim I'm interested in working on this part. If is there anything already done, I can take a look.

Tica2 on 21 May 2017

Currently, I'm trying to implement a DB password rotation to integrate Ecto with Hashicorp Vault database secret backend.

I have an implementation idea for current Ecto version, but It will be a very dirty hack:

By my findings, the configuration is stored only in DBConnection.Ownership.PoolSupervisor child's and supervisor itself has a public API to spawn second pool with a new configuration.
Ecto/Postgrex has a PID to the pool, which can be changed with low-level Erlang call :sys.replace_state.
In case Repo is getting restarted, an init/2 callback can be also aware of token rotation.

Other ideas:

Create a wrapper that will use multiple repos and rotate them (by @michalmuskala).
Restart repo and rely on init/2 callback (by @benwilson512).
Not related to Ecto. Write a supervisor that will rotate Kubernetes pods when a token is expiring, and application can re-init with a new state. Kubernetes services will allow to gracefully move traffic from one pod to another.

First will work, but It will require changes in existing codebase (which becomes large).

The second option won't work for our use-case. Passwords are going to have a short lifespan so stopping the application or its parts may be an issue.

For the same reason, the third one is also not a good idea (restarting application every few minutes).

But we can rotate it before previous one expires - so it's okay to have configuration in inconsistent state for a short period of time (eg. spawn two pools and shut down old one after new one is ready to process requests).

Any ideas on this?

AndrewDryga on 27 Jun 2017

It will require changes in existing codebase (which becomes large).

I feel like the changes can be mitigated if you just refactor the existing Repo module to be the wrapper itself. Then calls like query |> Repo.all don't need to change.

benwilson512 on 28 Jun 2017

Is this issue being worked on? I saw that dynamic pool configuration was delivered in #1977.

I am working on an application which allows users to connect and query databases at runtime and I cannot use Ecto if it depends on config files. I would love to start using this if possible.

pulzzejason on 6 Oct 2017

👍1

@pulzzejason this is up for grabs, so please go ahead. What is left is to introduce a Ecto.DynamicRepo that accepts configuration (or a DynamicRepo instance) as first argument and then works like Ecto.Repo.

Folks that have requested this feature so far, do you want to connect to the database on every query or do you want to have still be able to start the dynamic repository and have queries run against it?

josevalim on 6 Oct 2017

👍1

Folks that have requested this feature so far, do you want to connect to the database on every query or do you want to have still be able to start the dynamic repository and have queries run against it?

I don't think we should support connect to database on every query/transaction because it will be very slow and so few people would use it. Every transaction (a single query is an implicit transaction) would require a surprising number of round trips before the actual transaction begins. Also I'm not sure how easy it would be to get clean socket shutdown working in the client drivers. If we don't have a graceful close and rely on socket level closes we may get too much churn on the database and hit client limits because the backend won't react immediately. Some people saw the equivalent using pool_overflow with the poolboy pool.

fishcakez on 7 Oct 2017

I agree with @fishcakez here.

I'd mostly be using this dynamic repo functionality for a multi tenant scenario, so I'd like to keep the connections open.

(and the multi tenant storage must be different databases for a few reasons, some are contractual and some are just for peace of mind)

In Go (which is currently being used), I keep a map that references the database connection pointers with keys being the tenants' IDs. If it's not found, it creates a new one, stores a pointer to it in the map and allows me to use it.

Because we don't have many tenants, I'm not cleaning it up any yet but it probably wouldn't be a bad idea to remove connections if they haven't been used in a while.

mylanconnolly on 1 Nov 2017

👍3

Has anyone done more work on this? I need the functionality for a multi-tenant app I'm currently building and I'm happy to take it on, but would prefer not to duplicate others' work. 🙂

InAbsentia on 20 Feb 2018

I don't believe we have had any progress since then. :)

josevalim on 20 Feb 2018

Ok, thanks. I'll send a PR once I get started.

InAbsentia on 21 Feb 2018

👍2

I have a question for folks using dynamic repositories: how do you handle things like creating new database instances and migrating them?

josevalim on 26 Feb 2018

👍1

That’s a good question. 🤔 I hadn’t even considered it because, in my case, another system will be creating and managing the customer databases; my system just needs to read from them.

InAbsentia on 26 Feb 2018

How I implement this (maybe not the best way) is this:

I keep a "reference" database that has the same migrations and such applied as the rest of the tenant databases. This may also include some dummy records to make sure the new database is ready to go on creation.

Then, when I need to create a new one I connect with a user that has enough permissions to create databases and create a new database using the WITH TEMPLATE option.

Going forward when new migrations must be applied, I connect to my admin database (that houses a listing of tenants) to get a list of all tenants and generate the tenant database names (like tenant_1234, tenant_1235, etc.) and perform the tenant DB migrations on all of them.

Each tenant gets its name from a template, like tenant_{{.ID}}.

Keep in mind, this isn't in Elixir, it's in Go.

So, to summarize, my databases are like:

admin is the administrative database that contains a listing of tenants, users, and other global items we use internally for administrative purposes
tenant is the reference database and used as a template on creating new tenants
tenant_1234 is one tenant
tenant_1235 is another tenant

I found this is a fairly decent sweet spot where we can create a hard physical boundary between tenants' data (using tablespaces, we can put them on different physical media, utilize different encryption keys, etc.) without spinning up new servers for each one.

mylanconnolly on 26 Feb 2018

👍2

In our case, we're trying to write a generic Kafka Topic --> PG dumper, which simply creates a table with the same name as the kafka topic (optionally appending/prepending a namespace) if it does not exist and does bulk inserts into the table.

The idea is that another team is managing the database with their application (they currently all use Ecto, but it doesn't necessarily have to be), and our data processing systems can simply dump data into a few tables. The table definition could either be managed by the other teams, or dynamically generated as we're using Avro Schemas and enforcing backwards compatibility for schema evolution.

Creating multiple instances of a Repo is a great solution for this, as we can require the DB URL be passed in to start_link. The only missing piece for us is how to make bulk insert calls into a given repo.

CJPoll on 1 Mar 2018

Thanks @mylanconnolly!

@CJPoll in your case, the databases are known upfront, and not dynamically, right?

josevalim on 6 Mar 2018

@InAbsentia btw, if you are building multitenant apps with postgres schemas (instead of databases), you don't need this.

josevalim on 6 Mar 2018

@josevalim Thanks, but in my case, each customer has a separate database. The customers are enterprise corporations, so they're not keen on having their data commingling with others' data 😉

InAbsentia on 7 Mar 2018

This is pretty much done. In Ecto master we have decoupled the process name from the repository module. You can spawn multiple repositories with different names. The only thing missing is the API that allows the repository name to be given as argument - but that's the cherry of the top - the internals are clean.

josevalim on 29 May 2018

❤2 👍1

I want to emphasize though that in master all repositories must still be named (with an atom). The different is that the process name does not have to be the same as the module name.

josevalim on 29 May 2018

👍1

@josevalim that those changes are coming is really good news, looking forward to it!

Is the cherry on top (passing the repository name as argument as you state) also around the corner and is there a reason that the repositories must still named with an atom or is this something that will go away (or changed) in the future?

rmoorman on 29 May 2018

@rmoorman it can go away in the future, and I don't think it is a lot of work to make it go away, but given the usage patterns said in this thread and that folks do not want to start repos dynamically, it seems to be an ok limitation to have for now.

josevalim on 29 May 2018

@josevalim the thing I am wondering about regarding dynamic repo's and naming them with atoms is, that you would have to dynamically generate atoms based on strings (inside a database) to support the setup @mylanconnolly described (which is a use case I would like to explore as well, connecting to a changing set of databases on demand) but please correct me if I am wrong.

rmoorman on 29 May 2018

We were able to remove the :name requirement. :)

defmodule MyApp.DynamicRepo do
  def insert(pid, struct, opts \\ []) do
    Ecto.Repo.Schema.insert(pid, struct, opts)
  end

  def update(pid, struct, opts \\ []) do
    Ecto.Repo.Schema.update(pid, struct, opts)
  end

  def delete(pid, struct, opts \\ []) do
    Ecto.Repo.Schema.delete(pid, struct, opts)
  end
end

Now to use it, you should start a non-dynamic repo, but forcing a nil name (or picking up another name):

{:ok, pid} = MyApp.Repo.start_link(name: nil)
MyApp.DynamicRepo.insert(pid, %Post{})

And that should be it.

If you want to connect to different databases and this kind of stuff, just pass the options to start_link/1.

josevalim on 29 May 2018

❤13 🎉4 👍1

Is this scheduled to be included in a future release?

Edit: Thank you, @michalmuskala.

iautom8things on 9 Aug 2018

Yes this is scheduled to be part of the 3.0 release. You can track progress with the v3.0 milestone.

michalmuskala on 9 Aug 2018

👍6 ❤5

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Clearly document supported field data types

AndresOsinski · 5Comments

Casting empty strings is broken

stavro · 4Comments

Migrations fail: function add/1 undefined

jbence · 3Comments

"Cannot load `#Decimal<..>` as type :integer" on coalesce in subquery

jordi-chacon · 4Comments

Dynamic DynamicExpr doesn't work as expected

kelostrada · 3Comments