Ecto: Dynamic Repo configuration

Created on 24 Feb 2017  路  43Comments  路  Source: elixir-ecto/ecto

i have a use case where i'm unable to configure my database ahead of time via creating a Repo module. i know it's possible to compile at runtime but i'm uneasy about the possible security and performance implications of doing so. what i'd like is an alternative to Repo that accepts configuration at runtime

repo = [
  adapter: Ecto.Adapters.Postgres,
  database: "ecto",
  ...
]
DynamicRepo.all(repo, query, opts)

DynamicRepo would be identical to modules generated via using Ecto.Repo apart from the arity change of functions to accept repo options

this is doable now as a library but i'm unsure how much of a Repo's api is considered public and how much is considered private. if private, it may be beneficial to make the functionality part of ecto itself so the api's can be updated in tandem

i'm also open to alternatives if there is a better path forward

Advanced Discussion

Most helpful comment

We were able to remove the :name requirement. :)

I believe we have tackled everything required to make this work. The last step is to define an official API for pid-based repos but we would like to see more testing and get more feedback for doing so.

For those who want to give it a try, you can look at the callbacks defined by lib/ecto/repo.ex and define your own dynamic repo. The idea is that you get the same implementation as the one generated by lib/ecto/repo.ex, except you replace __MODULE__ by an explicit pid. For example, you can define in your own a DynamicRepo with insert, update, delete like below:

defmodule MyApp.DynamicRepo do
  def insert(pid, struct, opts \\ []) do
    Ecto.Repo.Schema.insert(pid, struct, opts)
  end

  def update(pid, struct, opts \\ []) do
    Ecto.Repo.Schema.update(pid, struct, opts)
  end

  def delete(pid, struct, opts \\ []) do
    Ecto.Repo.Schema.delete(pid, struct, opts)
  end
end

Now to use it, you should start a non-dynamic repo, but forcing a nil name (or picking up another name):

{:ok, pid} = MyApp.Repo.start_link(name: nil)
MyApp.DynamicRepo.insert(pid, %Post{})

And that should be it.

If you want to connect to different databases and this kind of stuff, just pass the options to start_link/1.

All 43 comments

@talentdeficit have you tried the init callback we have recently added in Ecto 2.1? I believe from that callback you should be able to retrieve the configuration dynamically. So you configure the bare minimum in the config/config.exs file and the remaining options are configured on start.

There are probably some configurations that we are still using statically, such as the pool configuration in #1953 and there are probably things that are based on the name, such as ets tables that would avoid multiple starts, but with your help we can identify those and improve the code today.

What do you think?

I was thinking about this as well. What if we had the Ecto.Repo (or some other module) export functions exactly as those generated in user repos, but accepting an additional argument - a config struct. For the regular repos, we could generate this config struct based on the env and delegate the implementation, similar to what we're already doing today with multiple modules in the Ecto.Repo.* namespace. If somebody needs dynamic access to the repo, they could build this struct manually and call the extended functions directly.

This would make the ecto interface function-first with the macro generation of use Ecto.Repo a mere convenience for removing the boilerplate of passing the config struct everywhere manually.

@michalmuskala it is not that straight-forward because you still need to start a supervision tree, the pool and the processes that connect to the database. You cannot simply pass a config to Ecto.Repo.insert and have it connect to the database, that would be inefficient and its processes wouldn't be properly supervised.

@michalmuskala although your description would be correct if, instead of passing of a config, we pass a pid of a previously started repo.

Yes, you'd also have a start_link or a child_spec where you'd pass the same config to start the "dynamic" repo - you could do this multiple times (with different configs) to start multiple "dynamic" repos.

A process pid works as well.

@michalmuskala I don't think it is possible to skip all of the steps you want to skip. For example, what about Repo.query! that are specific only to SQL backends? We need to know when to generate it. And again, you cannot pass a struct to insert, the pid is the only option that works. Once you consider a separate repo definition because of adapter concerns and factor in the need to supervise, you end-up with something similar to the current proposal.

@josevalim i believe with the init callback i still need to know which Repo's i'll need up front. i can change the config for a repo, but i can't create a new repo unless i'm missing something

@talentdeficit you can have a single Repo that works for everything that talks to Postgres. And start multiple instances of it connected to different backends.

@josevalim i'm confused. how does the repo know what database to send the request to?

Using init/2. :)

Or you can even pass the parameters on start_link: Repo.start_link(hostname: ..., database: ...)`.

i'm clearly missing something here

foo = Repo.start_link(foo_opts)
bar = Repo.start_link(bar_opts)

query = ...

Repo.all(query, opts)

how does Repo know how to dispatch?

@talentdeficit that's the other part that needs to be solved 馃槄 In the past we have supported an option to store it in the process dictionary (since the pool information was stored there too).

oh i'm sorry, i thought you meant this was possible now.

how about adapting @michalmuskala's suggestion and adding concrete functions to Repo that can be called with reference (pid or registered name) to the repo instance?

@talentdeficit we could do so... BUT then all the functions that are adapter specific, such as Repo.query!, to_sql, etc won't work.

they could just throw a undef error if the adapter doesn't support the function? you don't get compile time warnings but you get the same functionality/behavior as you would with a static repo. i think that is an ok tradeoff?

@talentdeficit that's what would ultimately happen, yes. It is worth that the functions generated by Ecto.Repo work almost all on the way you are proposing, it is a matter of exposing them at the Ecto.Repo level. A proof of concept could start exactly by invoking those private functions for now until we have all details in place.

The only non-dynamic part is the ecto query cache API which should also be fixable with the current registry. Then it is a matter of exposing the proper API (and agree where it should be exposed).

@josevalim I'm interested in working on this part. If is there anything already done, I can take a look.

Currently, I'm trying to implement a DB password rotation to integrate Ecto with Hashicorp Vault database secret backend.

I have an implementation idea for current Ecto version, but It will be a very dirty hack:

  1. By my findings, the configuration is stored only in DBConnection.Ownership.PoolSupervisor child's and supervisor itself has a public API to spawn second pool with a new configuration.
  2. Ecto/Postgrex has a PID to the pool, which can be changed with low-level Erlang call :sys.replace_state.
  3. In case Repo is getting restarted, an init/2 callback can be also aware of token rotation.

Other ideas:

  1. Create a wrapper that will use multiple repos and rotate them (by @michalmuskala).
  2. Restart repo and rely on init/2 callback (by @benwilson512).
  3. Not related to Ecto. Write a supervisor that will rotate Kubernetes pods when a token is expiring, and application can re-init with a new state. Kubernetes services will allow to gracefully move traffic from one pod to another.

First will work, but It will require changes in existing codebase (which becomes large).

The second option won't work for our use-case. Passwords are going to have a short lifespan so stopping the application or its parts may be an issue.

For the same reason, the third one is also not a good idea (restarting application every few minutes).

But we can rotate it before previous one expires - so it's okay to have configuration in inconsistent state for a short period of time (eg. spawn two pools and shut down old one after new one is ready to process requests).

Any ideas on this?

It will require changes in existing codebase (which becomes large).

I feel like the changes can be mitigated if you just refactor the existing Repo module to be the wrapper itself. Then calls like query |> Repo.all don't need to change.

Is this issue being worked on? I saw that dynamic pool configuration was delivered in #1977.

I am working on an application which allows users to connect and query databases at runtime and I cannot use Ecto if it depends on config files. I would love to start using this if possible.

@pulzzejason this is up for grabs, so please go ahead. What is left is to introduce a Ecto.DynamicRepo that accepts configuration (or a DynamicRepo instance) as first argument and then works like Ecto.Repo.

Folks that have requested this feature so far, do you want to connect to the database on every query or do you want to have still be able to start the dynamic repository and have queries run against it?

Folks that have requested this feature so far, do you want to connect to the database on every query or do you want to have still be able to start the dynamic repository and have queries run against it?

I don't think we should support connect to database on every query/transaction because it will be very slow and so few people would use it. Every transaction (a single query is an implicit transaction) would require a surprising number of round trips before the actual transaction begins. Also I'm not sure how easy it would be to get clean socket shutdown working in the client drivers. If we don't have a graceful close and rely on socket level closes we may get too much churn on the database and hit client limits because the backend won't react immediately. Some people saw the equivalent using pool_overflow with the poolboy pool.

I agree with @fishcakez here.

I'd mostly be using this dynamic repo functionality for a multi tenant scenario, so I'd like to keep the connections open.

(and the multi tenant storage must be different databases for a few reasons, some are contractual and some are just for peace of mind)

In Go (which is currently being used), I keep a map that references the database connection pointers with keys being the tenants' IDs. If it's not found, it creates a new one, stores a pointer to it in the map and allows me to use it.

Because we don't have many tenants, I'm not cleaning it up any yet but it probably wouldn't be a bad idea to remove connections if they haven't been used in a while.

Has anyone done more work on this? I need the functionality for a multi-tenant app I'm currently building and I'm happy to take it on, but would prefer not to duplicate others' work. 馃檪

I don't believe we have had any progress since then. :)

Ok, thanks. I'll send a PR once I get started.

I have a question for folks using dynamic repositories: how do you handle things like creating new database instances and migrating them?

That鈥檚 a good question. 馃 I hadn鈥檛 even considered it because, in my case, another system will be creating and managing the customer databases; my system just needs to read from them.

How I implement this (maybe not the best way) is this:

I keep a "reference" database that has the same migrations and such applied as the rest of the tenant databases. This may also include some dummy records to make sure the new database is ready to go on creation.

Then, when I need to create a new one I connect with a user that has enough permissions to create databases and create a new database using the WITH TEMPLATE option.

Going forward when new migrations must be applied, I connect to my admin database (that houses a listing of tenants) to get a list of all tenants and generate the tenant database names (like tenant_1234, tenant_1235, etc.) and perform the tenant DB migrations on all of them.

Each tenant gets its name from a template, like tenant_{{.ID}}.

Keep in mind, this isn't in Elixir, it's in Go.

So, to summarize, my databases are like:

  • admin is the administrative database that contains a listing of tenants, users, and other global items we use internally for administrative purposes
  • tenant is the reference database and used as a template on creating new tenants
  • tenant_1234 is one tenant
  • tenant_1235 is another tenant

I found this is a fairly decent sweet spot where we can create a hard physical boundary between tenants' data (using tablespaces, we can put them on different physical media, utilize different encryption keys, etc.) without spinning up new servers for each one.

In our case, we're trying to write a generic Kafka Topic --> PG dumper, which simply creates a table with the same name as the kafka topic (optionally appending/prepending a namespace) if it does not exist and does bulk inserts into the table.

The idea is that another team is managing the database with their application (they currently all use Ecto, but it doesn't necessarily have to be), and our data processing systems can simply dump data into a few tables. The table definition could either be managed by the other teams, or dynamically generated as we're using Avro Schemas and enforcing backwards compatibility for schema evolution.

Creating multiple instances of a Repo is a great solution for this, as we can require the DB URL be passed in to start_link. The only missing piece for us is how to make bulk insert calls into a given repo.

Thanks @mylanconnolly!

@CJPoll in your case, the databases are known upfront, and not dynamically, right?

@InAbsentia btw, if you are building multitenant apps with postgres schemas (instead of databases), you don't need this.

@josevalim Thanks, but in my case, each customer has a separate database. The customers are enterprise corporations, so they're not keen on having their data commingling with others' data 馃槈

This is pretty much done. In Ecto master we have decoupled the process name from the repository module. You can spawn multiple repositories with different names. The only thing missing is the API that allows the repository name to be given as argument - but that's the cherry of the top - the internals are clean.

I want to emphasize though that in master all repositories must still be named (with an atom). The different is that the process name does not have to be the same as the module name.

@josevalim that those changes are coming is really good news, looking forward to it!

Is the cherry on top (passing the repository name as argument as you state) also around the corner and is there a reason that the repositories must still named with an atom or is this something that will go away (or changed) in the future?

@rmoorman it can go away in the future, and I don't think it is a lot of work to make it go away, but given the usage patterns said in this thread and that folks do not want to start repos dynamically, it seems to be an ok limitation to have for now.

@josevalim the thing I am wondering about regarding dynamic repo's and naming them with atoms is, that you would have to dynamically generate atoms based on strings (inside a database) to support the setup @mylanconnolly described (which is a use case I would like to explore as well, connecting to a changing set of databases on demand) but please correct me if I am wrong.

We were able to remove the :name requirement. :)

I believe we have tackled everything required to make this work. The last step is to define an official API for pid-based repos but we would like to see more testing and get more feedback for doing so.

For those who want to give it a try, you can look at the callbacks defined by lib/ecto/repo.ex and define your own dynamic repo. The idea is that you get the same implementation as the one generated by lib/ecto/repo.ex, except you replace __MODULE__ by an explicit pid. For example, you can define in your own a DynamicRepo with insert, update, delete like below:

defmodule MyApp.DynamicRepo do
  def insert(pid, struct, opts \\ []) do
    Ecto.Repo.Schema.insert(pid, struct, opts)
  end

  def update(pid, struct, opts \\ []) do
    Ecto.Repo.Schema.update(pid, struct, opts)
  end

  def delete(pid, struct, opts \\ []) do
    Ecto.Repo.Schema.delete(pid, struct, opts)
  end
end

Now to use it, you should start a non-dynamic repo, but forcing a nil name (or picking up another name):

{:ok, pid} = MyApp.Repo.start_link(name: nil)
MyApp.DynamicRepo.insert(pid, %Post{})

And that should be it.

If you want to connect to different databases and this kind of stuff, just pass the options to start_link/1.

Is this scheduled to be included in a future release?

Edit: Thank you, @michalmuskala.

Yes this is scheduled to be part of the 3.0 release. You can track progress with the v3.0 milestone.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

AndresOsinski picture AndresOsinski  路  5Comments

stavro picture stavro  路  4Comments

jbence picture jbence  路  3Comments

jordi-chacon picture jordi-chacon  路  4Comments

kelostrada picture kelostrada  路  3Comments