Pipelines: PipelineParam only supports string value

Created on 7 Oct 2019  路  13Comments  路  Source: kubeflow/pipelines

What happened:
Even if I define a pipeline parameter as int or float, it's still serialized as a string when it's passed to a component. It seems PipelineParam currently only supports string value: https://github.com/kubeflow/pipelines/blob/master/sdk/python/kfp/dsl/_pipeline_param.py#L144-L145.

It would be nice if pipeline parameter values can support more primitive types such as int and float.

aresddsl lifecyclstale

Most helpful comment

So what is the suggested way to launch a pipeline with large amount of parameters (i.e a Neural Network configuration). I was config for the json serialized approach through but seems is not the right way to do it. I would like the hparams to show up in the "compare runs" view, but seems the only way to achieve that right now is to manually pass around all the parameters from the pipeline down to the component.

Any other solution/suggestion/future plans?

All 13 comments

Please tell us about your user scenario.
Users should not be using the PipelineParam class directly. Technically it should have been a private class.

PipelineParam.value was mostly a way to specify default values when declaring the pipeline
function. Using PipelineParam in function signature is not unneeded and deprecated.
PipelineParam.value will be deprecated and removed soon.

The proper way to declare a pipeline function is:

def my_pipeline(
    my_string: str,
    my_num: int = 7,
):
    ...

@Ark-kun no, we are not using PipelineParam in the pipeline function signature. User defines parameters with normal python primitive types. Using your example above:

def my_pipeline(
    string: string,
    num: int = 7,
)

Say in this pipeline, we define a tfx component, and num is passed to the component as a field in hparam dictionary. I noticed that in the generated argo workflow yaml file, the default value is serialized as string in the spec:

spec:
 arguments:
   parameters:
   - name: num
     value: '7'

and in the section of arguments for the container:

- container:
  args:
  - --hparams
  - '{"num": "{{inputs.parameters.num}}"}'

In this case, we'll get num value as string at runtime. I wonder if this is expected because of argo, or we can do something to infer the type somewhere.

(NOTE: we are not using vanilla kfp dsl, it's also possible that we can do something on our end)

In this case, we'll get num value as string at runtime.

Since Pipelines orchestrates containerized command-line programs, every piece of data that's passed between components become a string or a binary blob file at some point.
The only thing that can be passed to an arbitrary command-line program is an array of string arguments.

When using Lightweight python components feature (func_to_container_op), which creates component from function, a command-line wrapper is generated which serializes and deserializes the values.

In all other cases, this needs to be performed by the command-line program.

we define a tfx component, and num is passed to the component as a field in hparam dictionary.

How is that hparams dictionary defined?

'{"num": "{{inputs.parameters.num}}"}'

Whoever serializes this dictionary could technically remove quotes here, but it would be problematic if the value contains spaces or something else.
It's non-trivial to "deserialize" a value that does not exist yet. E.g int("{{inputs.parameters.num}}") does not work.

Generally, passing structures containing output references (as opposed by single output references) is not officially supported. It works in many cases, but it might be impossible or broken in other cases. For example, if the passed data contains quotes or special characters the JSON format will be broken and we can not prevent that.

Can you tell a bit more about your case?

Thanks for detailed explanation!

How is that hparams dictionary defined?

In our case, hparams is defined as a regular dictionary with type Dict[str, Any]. We have our own way to define a component but similar to kfp, e.g:

# def my_pipeline(label_key: str, hidden_layers: int = 7)
Trainer.op(
  pipeline_options,
  transformed_examples=transform.outputs["transformed_examples"],
  transform_output=transform.outputs["transform_output"],
  module_file=trainer,
  hparams={dict(hidden_layers=hidden_layers, label_key=label_key)}
)

After that, our library calls kfp's _create_task_factory_from_component_spec to get a container factory fn, so we can use it to create a container op. It seems that hparams value gets serialized with this line
https://github.com/kubeflow/pipelines/blob/646c2890de8c63eb324c8827d82645fedca1984a/sdk/python/kfp/components/_components.py#L237, that's where {{inputs.parameters.num}} gets the quotes.

As you mentioned earlier, it does seem to be non-trivial to come up with a perfect solution. To be clear, it's not a blocker for us. Our current workaround is adding a deserializer in trainer component to parse hparam values, and it works fine.

Another workaround might be to make a custom dict serializer that performs json.dumps and removes quotes around "{{pipelineparam..}}" and pass the string instead of the dictionary (strings are considered to already be serialized).

This is an issue for built-ins like ResourceOp.
Example:

kfp.dsl.ResourceOp(
        name="scale-deployment",
        k8s_resource=kubernetes.client.V1Deployment(
            spec={"replicas": replicas}, **deployment_definition
        ),
        action="patch",
        merge_strategy="merge",
    )

This fails because replicas is a PipelineParam and I cannot see a way to force it be an int. Type annotations didn't seem to work

This fails because replicas is a PipelineParam and I cannot see a way to force it be an int. Type annotations didn't seem to work

I guess the problem is hard to solve for strongly-typed structures.
You can instead create a generic Patch component that does not use ResourceOp. That way you can pass in the data without problems.

@Ark-kun , related to @grischa's comment above, I am trying to use pipeline params as the k8s_resource in ResourceOp. I am generating the resource definition in a ContainerOp and retrieving it using op_name.outputs["resource_definition"]. When I try to call json.loads on the output string to pass it to ResourceOp I get the error: TypeError: the JSON object must be str, bytes or bytearray, not PipelineParam. Is there any way around this??

I have tried using both NamedTuple( "Output",[("resource_name", str)]) and NamedTuple( "Output",[("resource_name", dict)])

@ssharpe42 ResourceOp (also pvolume, VolumeOp) are community contributions and may have bugs. It's pretty hard for our team to provide support for it.

When I try to call json.loads on the output string to pass it to ResourceOp I get the error: TypeError: the JSON object must be str, bytes or bytearray, not PipelineParam.

When and where (which machine) do you expect that json.loads call to be executed?

P.S. Have you just tried to pass ResourceOp(k8s_resource=str(op_name.outputs["resource_definition"])) ?

@Ark-kun,
Please feel free to point out bugs that part of the DSL and assign me to them when they appear.
As far as pvolumes are concerned, I doubt that they have bugs that are not hidden in add_volume, add_volume_mount, or the Kubernetes client (while on_prem.mount_pvc is error-prone as mentioned here).
Of course, I encourage users to break them so that we all learn :smile:

@ssharpe42,
Currently a ResourceOp accepts only dictionaries or kubernetes objects in k8s_resource constructor argument. Then, k8s_resource gets json.dump'd into the manifest attribute.
We could also make manifest available as well, and I'm confident that this issue can be resolved.

A hacky workaround would be to define a random rop = ResourceOp(...) with some dummy (but valid) k8s_resource and then forcefully assign

rop.resource.manifest = str(op_name.outputs["resource_definition"]))

Disclaimer: I haven't actually tried it hehe

So what is the suggested way to launch a pipeline with large amount of parameters (i.e a Neural Network configuration). I was config for the json serialized approach through but seems is not the right way to do it. I would like the hparams to show up in the "compare runs" view, but seems the only way to achieve that right now is to manually pass around all the parameters from the pipeline down to the component.

Any other solution/suggestion/future plans?

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

Was this page helpful?
0 / 5 - 0 ratings