Prisma1: Embedded Types

Created on 30 Jul 2018  路  25Comments  路  Source: prisma/prisma1

Intro

Prisma currently does not have dedicated support for embedded types. You can read the backstory on this decision in https://github.com/prismagraphql/prisma/issues/1160#issuecomment-345373277

With upcoming support for MongoDB, embedded types is no longer just an abstract concept but impacts how the data is stored. As such, now is a good time to revisit our support for embedded types

What is Embedded Types?

An embedded type is a type that can not stand on its own. It is always scoped to a parent node.

Prisma today

Note: open issue describing improvements we could introduce to the existing API: https://github.com/prismagraphql/prisma/issues/1889

In the following example Address is an embedded type that is owned by Company:

type Company {
  id: ID! @unique
  name: String!
  address: Address
}

type Address {
  street: String
  city: String
  country: String
}

Using one of Prismas SQL connectors this currently creates two separate types that can be manipulated independently:

mutation company {
  createCompany(data: {
    name: "MegaCorp"
  })
}

mutation address {
  createAddress(data: {
    street: "Downing Street"
    city: "London"
    country: "Europe"
  })
}

mutation combined {
  createCompany(data: {
    name: "MegaCorp"
    address: {
      street: "Downing Street"
      city: "London"
      country: "Europe"
    }
  })
}

As you can see, Prisma currently doesn't enforce that Address should only be available within the scope of a parent type.
Due to this, it is possible to end up with "dangling" Address nodes that are not connected to a Company. If Prisma supported proper embedded types, it would be possible to rely on Prisma to enforce this constraint.

In this issue we will consider Prismas existing types to not be embedded, independent of wether it has any unique fields or not. We will further suggest introducing a new notion of an embedded type.

Real embedded types

Embedded types differs from normal types in two key ways:

  • They can only be accessed and modified through its owning type
  • They are physically colocated with their parent type

Note: We should think hard about wether it is really a good idea to couple these two characteristics.

Some other characteristics of embedded types:

  • They can not have back-relations
  • They can not easily be migrated to normal types (the data would have to be copied to new table)

To configure a type to be embedded:

type Company {
  id: ID! @unique
  name: String!
  address: Address @relation(name: "CompanyAddress", storage: Embedded)
}

type Address {
  street: String
  city: String
  country: String
}

Storage models in various databases

SQL

Normal types are stored in individual tables and foreign keys are used to create relations.

Embedded types are stored in a Json column. Normal filters are supported

Note that the SQL Json type has some serious limitations. Neither Postgres nor MySQL support schema validation, so while Prisma will enforce validation during inserts and updates, migrations of embedded types will not change the shape of existing data.

Mongo

Normal types are stored in individual tables. Just like for the SQL connectors, relations are maintained by storing the id of related types. MongoDB does not support foreign key constraints and joins, so Prisma will do the work to enforce referential consistency. Relational filters are not supported.

Embedded types are stored very naturally in Mongos nested document structure. Normal filters are supported.

Note that MongoDB has a limit of 16 MB per document, limiting the size of embedded types, especially for embedded relations.

Neo4J

Normal types are stored as nodes and relations are natively supported.

It is unclear how we would store embedded types, but stored procedures to parse json might be an option: https://neo4j.com/developer/kb/using-apoc-to-parse-json-results-from-trello-api/ This would most likely make filters unavailable

DynamoDB

Normal types are stored in individual tables. Relations are handled like for MongoDB.

Embedded types are stored as stringified json. This makes filtering impossible

rf0-needs-spec areengine statustale

Most helpful comment

For SQL and quite possibly other DBs that don't support direct object embedding, I would suggest the embedded types be stored in unique columns, with the fully qualified path as the column name. This is how Doctrine's ORM handles the issue.

This allows individual fields to be queried directly as they are not JSON encoded.

For example:

type Company {
  id: ID! @unique
  name: String!
  address: Address
}

type Address {
  street: String
  city: String
  country: String
}

Would create a single table for the Company type and would map to the columns:

  • id
  • name
  • address_street
  • address_city
  • address_country

It's a simple, effective, proven solution and solves a lot of the problems mentioned above.

All 25 comments

I consider embedded types to be a storage optimization, subject to the support of the underlying storage engine. As you have illustrated above, most storage engines do not play well with embedded types. Basically, only Postgres and Mongo have some support for directly accessing the embedded data for searching/filtering etc.

I believe a potential problem arises when storage-engine specific configurations start to make their way into the _generic_ GraphQL schema. It should at all times be possible to swap out connectors, and everything should still continue to work. I should be able to use a local Postgres database on my development machine, a Mongo database on my test environment, and DynamoDB in production, without a single issue.

This leads me to the question whether the GraphQL schema is the correct place to store this information. I think the most important question is: what other storage engine / environment specific directives do you expect in the near future, and from that answer, decide the best way to specify this.

For example, as an alternative, there could be a single @storage directive, that can be used everywhere for anything storage specific (indices, embedded types, datatype hints, etc. etc.). This might or might not be more future proof.

Or maybe, this kind of 'storage hints' should not be part of the schema at all, but part of an additional configuration file, because like I mentioned above, swapping out connectors should be easy, so having storage engine specific hints in the schema might not be desirable at all.

So no crystal clear solutions from me yet, but I hope I have touched a few points to fuel the discussion.

Thanks for the thoughtful reply Kim - that's really helpful for furthering the discussion. You raise 3 main points that I will address individually:

All connectors should be equivalent

I should be able to use a local Postgres database on my development machine, a Mongo database on my test environment, and DynamoDB in production, without a single issue

This is a very fundamental decision we have to make. While I agree that there is huge benefit to having the configuration (SDL) and interface (GraphQL API) of all connectors be as similar as possible, I don't believe it is a core value proposition of Prisma that they are exactly the same.

Scenarios I think Prisma should support very well:

  • Pick the connector that offers best performance or query capabilities for your particular use case
  • Migrate between connectors with minimal effort
  • Learning to use one connector should be enough to know how to use all connectors
  • Use multiple connectors in a single service

Scenarios I don' think Prisma need to support:

  • Run the exact same service on different connectors in different environments.

I think we should have a class of connectors that are all interchangeable. SQL connectors is an obvious category, but we could also support multiple document databases, multiple graph databases etc.
Using MS SQL Server in production, MySQL locally and SQLlite in tests is a goal we should pursue.

I believe Prisma must expose as close to the full power of the underlying database as possible. To achieve this we must accept that there are differences between connectors. One such example is that all SQL connectors support relational queries. The MongoDB connector will be able to support relational filters only for embedded types due to lack of join support. We could decide to emulate relational filters for Mongo in the Prisma layer. This would work and be fast for small data sets, but would be unusable in a production environment, so I don't think we should support this.

Do you agree with this view of how Prisma should evolve?

Note: Currently the SQL connectors behave very differently when you are connecting to an existing database compared to using Prisma migrations to set up a new database. For example, it is possible to have relations that don't use a relation table. We will unify this behaviour over time.

Storage optimisation vs data model semantics

I think about embedded types this way:
By declaring a type embedded, the developer imposes limits on the API that is being exposed. This is similar to how one might declare a field on a class private. It provides no extra functionality, but can help you convey intent in a large system with many collaborators. When a field is private, the compiler can take advantage of this fact and perform more aggressive performance optimisations. In the same way, a Prisma connector might take advantage of the fact that a type is embedded and choose to colocate the data.

The desire to tie co-located storage and logical data encapsulation together is to reduce the number of concepts required to understand Prisma. Ultimately, this might not be the best way to go, so let's explore it further.

There are 4 possible combinations:

1. co-located storage and logical data encapsulation

This is what is called embedded types in this proposal

2. co-located storage and no data encapsulation

This would be very cumbersome to implement and provide no performance benefits in most cases

3. normalised storage and logical data encapsulation

We can implement this, but it would provide no performance benefit. This would allow for easy migration between the two encapsulation modes

4. normalised storage and no data encapsulation

This is just our normal types.

So what we are talking about is wether we should implement just 1 or allow configuration to choose between 1 and 3. I'm not convinced 3. Provide enough value that we need to allow for this.

Other storage engine specific configuration

This is what comes to mind:

  • Indexes
  • Caching
  • Computed values / projections (pre-compute vs on-the-fly)
  • Storage locality

It's worth noting that the datamodel.graphql file currently servers two purposes:

  • Describe the mapping required to construct a GraphQL API over a database
  • Describe the desired state of a database so the migration system can apply the correct migrations

We might benefit from thinking about these two purposes independently

For SQL and quite possibly other DBs that don't support direct object embedding, I would suggest the embedded types be stored in unique columns, with the fully qualified path as the column name. This is how Doctrine's ORM handles the issue.

This allows individual fields to be queried directly as they are not JSON encoded.

For example:

type Company {
  id: ID! @unique
  name: String!
  address: Address
}

type Address {
  street: String
  city: String
  country: String
}

Would create a single table for the Company type and would map to the columns:

  • id
  • name
  • address_street
  • address_city
  • address_country

It's a simple, effective, proven solution and solves a lot of the problems mentioned above.

I agree with @ryall . A further point, I don't think embedded GraphQL types match the typical usecase of JSON in relational DB's because we are still dealing with a pre-defined schema. I think JSON in relational DB's would be better suited to a JSON scalar type in the Graphql schema, where the shape of the data isn't known by the GraphQL schema or DB schema.

Thanks for the pointer to Embeddables @ryall! Turns out Hibernate (as specified by JPA) supports the same concept and also explode fields of the embedded type to columns on the parent.

This implementation would make embedded types very different between the SQL connectors and the Mongo connector. Unification of the two is the main reason I think the Json type might be interesting. Keep in mind that Prisma can enforce the schema when inserting data.

As an example the following data model would be supported by the Mongo connector:

type Listing {
  name: String
  details: ListingDetail @relation(storage: Embedded)
  comments: [ListingComment] @relation(storage: Embedded)
}

type ListingDetail {
  price: Int
}

type ListingComment {
  user: Strng
  text: String
}

This would result in a single collection of documents looking like this:

{
  name: "used bike"
  details: {
    price: 35000
  }
  comments: [
  {
    user: "S酶ren",
    text: "I love this bike!"
  }
  {
    user: "Karl",
    text: "The color is great"
  }
  ]
}

and support queries like this:

{
  listings(where: {comments_any: {user: "S酶ren"}}) {
    name
    details {
      pricce
    }
  }
}

The method used by Doctrine and Hibernate would support the details field but not comments:

create table Listing (
    name text,
    details_user text,
    details_text text,

    # no way to represent an array
)

If instead we use a json field we would be able to support the same data model and queries. Furthermore, both MySQL and Postgres support creating indexes on paths in Json by extracting to a generated column: https://dev.mysql.com/doc/refman/8.0/en/json-search-functions.html.

@sorenbs Insertion can be enforced, but wouldn't using JSON complicate migration?

@sorenbs To me an embeddable is still a first-class citizen on the parent-object and so it should be treated in every way as simply a property subset of the parent-object with the only variation being a different access path. They are just a naming scheme, nothing more. Embeddables still hold core data that needs to be queried and filtered and with JSON encoding this is going to be either extremely hacky or extremely restricted.

Many DBs are going to need a flattened solution and others will be able to work with a true embedded solution. This could possibly be abstracted out into two path management schemes that can be reused with various connectors. One uses the path directly address.street and the other flattens into a unique column address_street. I believe most, if not all, document databases that support embedded objects can access the data with dot-notation. Other than that I don't see any major differences between the two.

I updated my comment above to include an example of a data model that is trivial to model with the Mongo connector, but would not be possible with the traditional approach.

@frankdugan3 - you are right about migrations.

@ryall - I agree that some databases do not have first class support for hierarchical data. But many relational databases now has excellent support for hierarchical data with their Json implementation. I'll go through all major databases and compile a list of their support.

I think the syntax for configuring embedded types should rather live on the type itself. In this case the initial example would look something like this:

type Company {
  id: ID! @unique
  name: String!
  address: Address 
}

type Address @embedded {
  street: String
  city: String
  country: String
}

Why?

  • All over the issue we are talking about types that are embedded. We are not talking about specific relations that are embedded or not. This hints toward that this configuration should be tied to the type.
  • Not marking the type itself as embedded would mean that types can be used at the same time in embedded and normal relations. However this does not make sense. If a type is used in an embedded relation, it does not have an identity itself (can only be queried through the parent). This means it may not have an unique field. If this is the case a normal type is not distinguishable from an embedded type in our API, because:

    • It can not be queried on the top level itself.

    • You can not use nested mutation features like connect and disconnect that require that the related type has a unique field.

The motivation was probably that the developer should have the flexibility to decide when a type is stored embedded or normal way. To achieve this one would have to introduce separate types to achieve this:

type EmbeddedAddress @embedded {
  street: String
  city: String
  country: String
}
type Address {
  id: ID! @unique
  street: String
  city: String
  country: String
  company: Company!
}

@mavilein Agreed, the embedded type being marked makes a lot of sense.

@sorenbs OK with the example I see your point a bit more.

Then maybe list embeddables would need to remain in a separate table, which is still efficient and searchable. My main concern for this arises from making lots of unnecessary one-to-one joins on multiple embedded sub types anyway.

Alternatively, list-type embeddables could be simply disabled in databases that don't support it. I'm all for ORM structures but sometimes trying to make different database technologies function identically just comes with hacks and headaches. Better to know the subset your chosen database supports and use related features. Again it isn't uncommon to find ORMs which have differences per database because you simply can't please them all with one solution.

Additionally, there could be an @json directive to allow JSON encoding of one-to-one embeddables but the developer should be made aware that this then prevents querying and filtering on those fields. It's an edge case to allow developers who really don't want any joins to have proper embedded types in SQL.

@mavilein I also like a directive on the Type itself.

@ryall @sorenbs: Regarding the underlying database support for embedded types and querying/filtering. There's an interesting choice to be made here. If you want to have a GraphQL server that is storage agnostic, the capabilities of the underlying storage engine should have zero effect on the capabilities of the server. In other words, if querying and filtering is not supported by the storage engine, querying and filtering should happen in the GraphQL server. Yes, that would mean very inefficient data access, but I think there is a fundamental decision that needs to be made on this point.
Support for embedded types is not going to be the only difference we're gonna find when looking at different storage engines (a.k.a. underlying databases), and deciding at an early stage how much effect you want these differences to have on the functionality of your GraphQL server is very important.

On this point, I fundamentally disagree with what @sorenbs stated above. I think you _should_ be able to run the exact same service on different connectors. For this, the specific directives/instructions you will be able to provide in your GraphQL schema to affect the database side might not have any effect on some storage engines, but that's okay. Maybe the deployment logging should clearly show per type how it processed those instructions for the specific storage engine selected. You might get something like:

Type Address: Type is marked as embedded, but the selected connector does not support embedded types, or:
Type Post: Multi-column index specified, but the selected connector does not support multi-column indices. Using default index

Thanks for all the opinions and the thoughtful discussion in here. As we are currently implementing the Mongo connector we had to make some decisions. So we are going to have the @embedded directive on types as @mavilein proposed.

Additionally we identified a need for a nested updateMany/deleteMany since otherwise toMany embedded types could not be addressed. This is our current idea what these would look like:

# toOne datamodel

type Parent {
  id: ID! @unique
  name: String!
  child: Child
}

type Child @embedded {
  name: String!
}

# create -> overwrites old child if there is one
mutation {
  updateParent(
    where: { id: "1" }
    data: { child: { create: { name: "Paul" } } }
  ) {
    id
  }
}

# update -> updates the given fields on the only child
mutation {
  updateParent(
    where: { id: "1" }
    data: { child: { update: { name: "Peter" } } }
  ) {
    id
  }
}

# delete -> deletes the only child
mutation {
  updateParent(where: { id: "1" }, data: { child: { delete: true } }) {
    id
  }
}
# toMany datamodel

type Parent {
  id: ID! @unique
  name: String!
  children: [Child!]!
}

type Child @embedded {
  name: String!
}

# create -> adds to existing children
mutation {
  updateParent(
    where: { id: "1" }
    data: { children: { create: [{ name: "Paul" }, { name: "Peter" }] } }
  ) {
    id
  }
}

# update -> updates all children hit by filter, empty filter hits all
mutation {
  updateParent(
    where: { id: "1" }
    data: {
      children: {
        updateMany: { where: { name: "Peter" }, data: { name: "Spidey" } }
      }
    }
  ) {
    id
  }
}

# delete -> deletes all children hit by filter, empty filter hits all
mutation {
  updateParent(
    where: { id: "1" }
    data: { children: { deleteMany: { where: { name: "Peter" } } } }
  ) {
    id
  }
}

These would not be limited to embedded types only but would also be available on nested mutations for normal types.

Let us know if you have concerns or ideas around this!

@do4gr The format outlined looks very good, and I am excited about the addition of "updateMany" and "deleteMany" nested mutations in general. I can see other use cases that these fulfill beyond embedded types. Thank you for following up on this topic!

replace mutation

The current API is great for changing a subset of a document, but it is pretty annoying to replace an entire document. With the introduction of embedded types we will need a way to override an entire document including all embedded types.

Given the schema from above:

type Parent {
  id: ID! @unique
  name: String!
  child: Child
}

type Child @embedded {
  name: String!
}

The object representation of a full Parent looks like this:

{
  id: "1",
  name: "Karl",
  child: {
    name: "Caroline"
  }
}

With the current API, in order to override the entire document you would need to perform the following nested mutation:

mutation updateParent(
  where: { id: "1" }
  update: {
    name: "Karl 2"
    child: {
      update: { name: "Caroline 2" }
    }
  }
)

The issue here is that you cannot just pass the updated JSON object as you need to inject an update field for each embedded type.

MongoDB has a special operator to handle this case called replaceOne. We could introduce a similar top level mutation:

mutation replaceParent(
  where: { id: "1" }
  data: {
    id: "1"
    name: "Karl 2"
    child: {
      name: "Caroline 2"
    }
  }
)

A fairly common way to work with documents in MongoDB is to load an entire document into memory, change multiple fields in memory and then override the entire document. Introducing the replace mutation would facilitate this use case.

Dealing with relations

The above example illustrates how the replace mutation would work with scalar fields and embedded types. It gets more complicated when we take relations into account. If we extend the data model from before to include a relation on the parent type like this:

type Parent {
  id: ID! @unique
  name: String!
  child: Child
  other: Other
}

type Other {
  id :ID! @unique
  name: String!
}

type Child @embedded {
  name: String!
}

The following mutation would still update all the fields in Parent and Child, but ti would break the relation to Other

mutation replaceParent(
  where: { id: "1" }
  data: {
    id: "1"
    name: "Karl 2"
    child: {
      name: "Caroline 2"
    }
  }
)

The replace mutation only operates on a single document (IE a node and all it's embedded types), so the following is not allowed:

mutation replaceParent(
  where: { id: "1" }
  data: {
    id: "1"
    name: "Karl 2"
    child: {
      name: "Caroline 2"
    }
    other: {
      id: "other-id"
      name: "Other name 2"
    }
  }
)

Proposal: introduce special syntax to enable replacing an entire document without breaking relations

If we allow setting just the id of a related node, it would be possible to query the entire document, including the id of related nodes and use the replace mutation without breaking relations:

Query

{
  parent(where: {id: 1}) {
    id
    name
    child {
      name
    }
    other {
      id
    }
  }
}

returns

{
  id: "1",
  name: "Karl",
  child: {
    name: "Caroline"
  }
  other: {
    id: "other-id"
  }
}

Mutation

mutation replaceParent(
  where: { id: "1" }
  data: {
    id: "1"
    name: "Karl 2"
    child: {
      name: "Caroline 2"
    }
    other: {
      id: "other-id"
    }
  }
)

Open Questions

  • [ ] How to handle to-many relations?
  • [ ] Would it be better disallow replace mutations for documents with relations?
  • [ ] Would it be better to always maintain existing relations without requiring related ids to be explicitly provided?

Does Mongo specifically allow inserting and deleting of embedded objects natively or is this something Prisma is trying to add on top? It currently feels unnatural to me how this is being implemented.

If I embed something, I classify it as being part of the base object, with the same rules and restrictions that come with it. If I want to add/remove children, I use a relation. Adding/removing from an embeddable feels very weird.

@kbrandwijk I don't think it's at all unreasonable to have per-database restrictions. Trying to make a 100% storage-agnostic layer is just going to cause headaches for the Prisma devs and unnecessary restrictions for users of the lib. Prisma will only ever be as good as the worst database implementation if that approach is followed and the argument of doing it inefficiently in code is not really valid I think. If it can't be done efficiently, the user should reconsider their design.

Developers make choices all the time and we can base our design around the backend we are using. It's a massive edge case changing your backend in production and if it is necessary, it should require more than changing a config string. Again referring to other ORMs, I don't think I've ever seen any that are 100% backend agnostic. It's a feature that causes problems and offers little value.

That being said, most of the implementation CAN and SHOULD be agnostic. I just disagree with including the edge cases in this.

@ryall Mongo does natively allow for insertion and deletion even in deeply nested embedded structures using the positional operator https://docs.mongodb.com/manual/reference/operator/update/positional-all/. It is just a little unwieldy to write these queries in Mongo which I think just increases the advantage of using the Prisma api to do so.

Very excited to see this feature coming!

Just wanted to second @sorenbs proposal for some syntax for updating an entire doc at once.

In a document-oriented database like Mongo, subdocuments are often used more to structure the data than as true child documents:

{
  name: {
    first: "Joe",
    last: "Schmo",
  }
  address: {
    street: "5 Main Street",
    city...
  }
}

In this case, it feels like having to use a special syntax to update the subfields seems quite awkward, when all you want to do is:

mutation updateUser(
  where: { id: "1" }
  data: {
    address: {
      street: "5 Main Street"
    }
  }
)

Absolutely. What I'd suggest is that this is separated into a different concept. Keep @embeddable for simple flat structures and use something like @list for complex one to many sub-structures. This can then be optimised for both SQL and NoSQL without conflict and can simplify the syntax considerably for many use-cases.

This all sounds pretty exciting (:

Is there any rough timeline when embedded types can be tested?

You can try them in the Mongo Connector preview at the moment. https://github.com/prisma/Mongo-Connector-Preview We do not have a concrete timeline for when they will come to SQL connectors yet.

Thanks a lot @do4gr! We will definitely check this out. Also: will it be possible to manually control the order of how the embedded items are returned?

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 10 days if no further activity occurs. Thank you for your contributions.

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 10 days if no further activity occurs. Thank you for your contributions.

One question:

Now with Data model V1.1, embebed types does not exists any more. that's correct?

Was this page helpful?
0 / 5 - 0 ratings

Related issues

marktani picture marktani  路  34Comments

sapkra picture sapkra  路  40Comments

marktani picture marktani  路  35Comments

schickling picture schickling  路  36Comments

wereHamster picture wereHamster  路  37Comments