I have a binary- classification dataset having both categorical and numerical features similar to the titanic dataset. I have created a sklearn pipeline to preprocess the data and then using RandomForest to classify the data.
I am able to load the model in C++ onnx runtime but not able to understand how to prepare the input data for prediction.
The samples given are all dealing with Tensor data format.
Could somebody give some sample link that has examples regarding the classical ML model and their input data preparation for prediction?
Are you planning to use the Python API for ORT or C++?
@faxu I need to use C++ APIs for ORT.
Are you planning to use the Python API for ORT or C++?
I need to use C++ APIs.
Have you looked at this:
I have looked into this but I am looking for some examples of traditional ML models on tabular data such as "Titanic".
Even for tabular data, you would have a vector of a specific shape(M x N), similar to input_tensor_values above. Then, you could use CreateTensorWithDataAsOrtValue() to create input tensor from your vector, passing input_node_dims set to [1, M, N] and dim_len = 3.
Even for tabular data, you would have a vector of a specific shape(M x N), similar to input_tensor_values above. Then, you could use CreateTensorWithDataAsOrtValue() to create input tensor from your vector, passing input_node_dims set to [1, M, N] and dim_len = 3.
@prabhat00155 I am trying to use the titanic dataset (https://www.kaggle.com/c/titanic/data?select=train.csv). It has a mixed type of columns (int, float, string) that I have handled in the model pipeline. In python onnxruntime it is easier as it supports mixed types.
Is it possible to do it in C++? I am trying to do it but not succeeded yet. If possible could you give me some similar reference?
I am trying to do something like 👍
size_t input_tensor_size = 2 * 5; // simplify ... using known dim values to calculate size
// use OrtGetTensorShapeElementCount() to get official size!
std::vector<std::string> input_tensor_values = { "3", "female", "1.0", "15.7417", "C", "2", "male", "18.0", "13.0000"};
std::vector<const char*> output_node_names = { "softmaxout_1" };
// initialize input data with values in [0.0, 1.0]
/*for (unsigned int i = 0; i < input_tensor_size; i++)
input_tensor_values[i] = (float)i / (input_tensor_size + 1);*/
// create input tensor object from data values
auto memory_info = Ort::MemoryInfo::CreateCpu(OrtArenaAllocator, OrtMemTypeDefault);
Ort::Value input_tensor = Ort::Value::CreateTensor<std::string>(memory_info, input_tensor_values.data(), input_tensor_size, input_node_dims.data(), 5);
assert(input_tensor.IsTensor());
But it is giving error in compilation Severity Code Description Project File Line Suppression State
Error C2027 use of undefined type 'Ort::TypeToTensorType
``
Do we need to use
inline Value Value::CreateMap(Value& keys, Value& values) {
OrtValue* out;
OrtValue* inputs[2] = {keys, values};
ThrowOnError(Global<void>::api_.CreateValue(inputs, 2, ONNX_TYPE_MAP, &out));
return Value{out};
}
for this case
@prabhat00155 @faxu Please provide some guidance in this issue.
I see, your challenge is supporting mixed type in your ONNX model.
Lets take an example, a data frame with int and float features:
df = pd.DataFrame({'age': [23, 24, 30, 50, 70],
'balance': [100.23, 123.43, 111.11, 123.09, 222.2]})
labels = np.array([0, 1, 1, 0, 1])
model = Pipeline([('ct', ColumnTransformer([('scaler', StandardScaler(), ['age'])], remainder='passthrough')),
('mlp', MLPClassifier())])
model.fit(df, labels)
To convert this to ONNX, I would do the following(set two inputs):
onnx_fs = convert_sklearn(model, 'mlp',
[('age', Int64TensorType([None, 1])),
('balance', FloatTensorType([None, 1]))])
save_model(onnx_fs, 'mlp.onnx')
To score this model in Python, you could do this:
sess = InferenceSession('mlp.onnx')
res = sess.run('', input_feed={'age': df['age'].values.reshape((-1, 1)).astype(np.int64),
'balance': df['balance'].values.reshape((-1, 1)).astype(np.float32)})
In C++, we need to create separate tensors for the inputs and put them all in an array before passing the array to Run().
std::vector<int64_t> age = {23, 24, 30, 50, 70};
std::vector<float> balance = {100.23, 123.43, 111.11, 123.09, 222.2};
std::vector<const char*> output_node_names = {"output_label", "output_probability"};
auto memory_info = Ort::MemoryInfo::CreateCpu(OrtArenaAllocator, OrtMemTypeDefault);
Ort::Value age_tensor = Ort::Value::CreateTensor<int64_t>(memory_info, age.data(), age.size(), input_dims.data(), 2);
Ort::Value balance_tensor = Ort::Value::CreateTensor<float>(memory_info, balance.data(), balance.size(), input_dims.data(), 2);
Ort::Value input_tensor[] = {std::move(age_tensor), std::move(balance_tensor)};
auto output_tensors = session.Run(Ort::RunOptions{nullptr}, input_node_names.data(), input_tensor, 2, output_node_names.data(), 2);
If you have string features, you'll have to use the C API FillStringTensor().
I see, your challenge is supporting mixed type in your ONNX model.
Lets take an example, a data frame with int and float features:df = pd.DataFrame({'age': [23, 24, 30, 50, 70], 'balance': [100.23, 123.43, 111.11, 123.09, 222.2]}) labels = np.array([0, 1, 1, 0, 1]) model = Pipeline([('ct', ColumnTransformer([('scaler', StandardScaler(), ['age'])], remainder='passthrough')), ('mlp', MLPClassifier())]) model.fit(df, labels)To convert this to ONNX, I would do the following(set two inputs):
onnx_fs = convert_sklearn(model, 'mlp', [('age', Int64TensorType([None, 1])), ('balance', FloatTensorType([None, 1]))]) save_model(onnx_fs, 'mlp.onnx')To score this model in Python, you could do this:
sess = InferenceSession('mlp.onnx') res = sess.run('', input_feed={'age': df['age'].values.reshape((-1, 1)).astype(np.int64), 'balance': df['balance'].values.reshape((-1, 1)).astype(np.float32)})In C++, we need to create separate tensors for the inputs and put them all in an array before passing the array to Run().
std::vector<int64_t> age = {23, 24, 30, 50, 70}; std::vector<float> balance = {100.23, 123.43, 111.11, 123.09, 222.2}; std::vector<const char*> output_node_names = {"output_label", "output_probability"}; auto memory_info = Ort::MemoryInfo::CreateCpu(OrtArenaAllocator, OrtMemTypeDefault); Ort::Value age_tensor = Ort::Value::CreateTensor<int64_t>(memory_info, age.data(), age.size(), input_dims.data(), 2); Ort::Value balance_tensor = Ort::Value::CreateTensor<float>(memory_info, balance.data(), balance.size(), input_dims.data(), 2); Ort::Value input_tensor[] = {std::move(age_tensor), std::move(balance_tensor)}; auto output_tensors = session.Run(Ort::RunOptions{nullptr}, input_node_names.data(), input_tensor, 2, output_node_names.data(), 2);If you have string features, you'll have to use the C API FillStringTensor().
I am using following API to convert string to Tensor:
std::vector<std::string> input_tensor_values_sex = { "male" , "female", "male"};
Ort::Value input_tensor_sex = Ort::Value::CreateTensor(memory_info, input_tensor_values_sex.data(), input_tensor_size, input_dim.data(), 2, ONNXTensorElementDataType::ONNX_TENSOR_ELEMENT_DATA_TYPE_STRING);
assert(input_tensor_sex.IsTensor());
But It seems it not correctly converting it to tensor. CreateTensor() API is asking "p_data_byte_count" as its third parameter. I think I am passing it a wrong value. I am not sure how to get the byte for array of string? Is it sum of bytes of individual elements?
However, If this API cannot do the job then for FillStringTensor(), it also askes s_len. So what is this s_len in case of an array of string?
Here is an example where I use string features as well:
model = Pipeline([('ct', ColumnTransformer([('ohe', OneHotEncoder(), ['gender'])], remainder='passthrough')),
('mlp', MLPClassifier())])
model.fit(df, labels)
onnx_fs = convert_sklearn(model, 'mlp',
[('age', Int64TensorType([None, 1])),
('gender', StringTensorType([None, 1])),
('balance', FloatTensorType([None, 1]))])
save_model(onnx_fs, 'mlp.onnx')
Here is how I would do inferencing in C++:
std::vector<int64_t> age = {23, 24, 30, 50, 70};
const char* gender[5] = {"male", "female", "male", "female", "male"};
std::vector<float> balance = {100.23, 123.43, 111.11, 123.09, 222.2};
std::vector<const char*> output_node_names = {"output_label", "output_probability"};
std::vector<int64_t> input_dims = {5, 1};
auto memory_info = Ort::MemoryInfo::CreateCpu(OrtArenaAllocator, OrtMemTypeDefault);
Ort::Value age_tensor = Ort::Value::CreateTensor<int64_t>(memory_info, age.data(), age.size(), input_dims.data(), 2);
Ort::Value gender_tensor = Ort::Value::CreateTensor(allocator, input_dims.data(), input_dims.size(), ONNX_TENSOR_ELEMENT_DATA_TYPE_STRING);
Ort::GetApi().FillStringTensor(static_cast<OrtValue*>(gender_tensor), gender, 5U);
Ort::Value balance_tensor = Ort::Value::CreateTensor<float>(memory_info, balance.data(), balance.size(), input_dims.data(), 2);
Ort::Value input_tensor[] = {std::move(age_tensor), std::move(gender_tensor), std::move(balance_tensor)};
// score model & input tensor, get back output tensor
auto output_tensors = session.Run(Ort::RunOptions{nullptr}, input_node_names.data(), input_tensor, 3, output_node_names.data(), 2);
assert(output_tensors.size() == 2 && output_tensors.front().IsTensor());
int64_t* labels = output_tensors[0].GetTensorMutableData<int64_t>();
for (int i = 0; i < 5; i++)
{
std::cout << labels[i] << "\t";
}
FillStringTensor() takes the number of string elements as its final argument(I have array of size 5).
std::vector
input_dims = {5, 1};
Thank you for your help.
I did what you suggested i.e. I generated the data as:
df = pd.DataFrame({'age': [23, 24, 30, 50, 70], 'gender': ["male", "female", "male", "female", "male"],
'balance': [100.23, 123.43, 111.11, 123.09, 222.2]})
But when I am trying to convert it to onnx format like:
from skl2onnx import convert_sklearn
from skl2onnx.common.data_types import FloatTensorType, StringTensorType, Int64TensorType
from winmltools.utils import save_model
onnx_fs = convert_sklearn(model, 'mlp',
[('age', Int64TensorType([None, 1])),
('gender', StringTensorType([None, 1])),
('balance', FloatTensorType([None, 1]))])
save_model(onnx_fs, 'mlp_string.onnx')
It gives the error as;
RuntimeError: Columns must have the same type. C++ backends do not support mixed types. Inputs:
Variable(raw_name='age', onnx_name='age', type=Int64TensorType(shape=[None, 1]))
Variable(raw_name='balance', onnx_name='balance', type=FloatTensorType(shape=[None, 1]))
What is the mistake in code here? Do I need to handle it differently in c++?
This issue is caused due to your skl2onnx version. You won't see this if you use master to build skl2onnx(or wait for the next version of skl2onnx). How did you create your titanic model? I think that has multiple inputs. You could use C++ inferencing code I posted to run inferencing on that model.
Inferencing on the titanic ONNX model your shared:
const char* pclass[5] = {"1", "2", "1", "2", "3"};
const char* sex[5] = {"male", "female", "male", "female", "male"};
std::vector<float> age = {23, 24, 30, 50, 70};
std::vector<float> fare= {110.23, 13.43, 113.11, 13.09, 5.2};
const char* embarked[5]= {"S", "C", "S", "S", "S"};
std::vector<const char*> output_node_names = {"label", "probabilities"};
std::vector<int64_t> input_dims = {5, 1};
auto memory_info = Ort::MemoryInfo::CreateCpu(OrtArenaAllocator, OrtMemTypeDefault);
Ort::Value pclass_tensor = Ort::Value::CreateTensor(allocator, input_dims.data(), input_dims.size(), ONNX_TENSOR_ELEMENT_DATA_TYPE_STRING);
Ort::GetApi().FillStringTensor(static_cast<OrtValue*>(pclass_tensor), pclass, 5U);
Ort::Value gender_tensor = Ort::Value::CreateTensor(allocator, input_dims.data(), input_dims.size(), ONNX_TENSOR_ELEMENT_DATA_TYPE_STRING);
Ort::GetApi().FillStringTensor(static_cast<OrtValue*>(gender_tensor), sex, 5U);
Ort::Value age_tensor = Ort::Value::CreateTensor<float>(memory_info, age.data(), age.size(), input_dims.data(), 2);
Ort::Value fare_tensor = Ort::Value::CreateTensor<float>(memory_info, fare.data(), fare.size(), input_dims.data(), 2);
Ort::Value embarked_tensor = Ort::Value::CreateTensor(allocator, input_dims.data(), input_dims.size(), ONNX_TENSOR_ELEMENT_DATA_TYPE_STRING);
Ort::GetApi().FillStringTensor(static_cast<OrtValue*>(embarked_tensor), embarked, 5U);
Ort::Value input_tensor[] = {std::move(pclass_tensor), std::move(gender_tensor), std::move(age_tensor), std::move(fare_tensor), std::move(embarked_tensor)};
// score model & input tensor, get back output tensor
auto output_tensors = session.Run(Ort::RunOptions{nullptr}, input_node_names.data(), input_tensor, 5, output_node_names.data(), 2);
assert(output_tensors.size() == 2 && output_tensors.front().IsTensor());
int64_t* labels = output_tensors[0].GetTensorMutableData<int64_t>();
for (int i = 0; i < 5; i++)
{
std::cout << labels[i] << std::endl;
}
This issue is caused due to your skl2onnx version. You won't see this if you use master to build skl2onnx(or wait for the next version of skl2onnx). How did you create your titanic model? I think that has multiple inputs. You could use C++ inferencing code I posted to run inferencing on that model.
My code for th model is:
titanic_url = ('https://raw.githubusercontent.com/amueller/'
'scipy-2017-sklearn/091d371/notebooks/datasets/titanic3.csv')
data = pd.read_csv(titanic_url)
X = data.drop('survived', axis=1)
y = data['survived']
# SimpleImputer on string is not available for
# string in ONNX-ML specifications.
# So we do it beforehand.
for cat in ['embarked', 'sex', 'pclass']:
X[cat].fillna('missing', inplace=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
numeric_features = ['age', 'fare']
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())])
categorical_features = ['embarked', 'sex', 'pclass']
categorical_transformer = Pipeline(steps=[
# --- SimpleImputer is not available for strings in ONNX-ML specifications.
# ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', OneHotEncoder(handle_unknown='ignore'))])
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features),
])
clf = Pipeline(steps=[('preprocessor', preprocessor),
('cls', XGBClassifier())])
update_registered_converter(
XGBClassifier, 'XGClassifier',
calculate_linear_classifier_output_shapes,
convert_xgb)
clf.fit(X_train, y_train)
And to convert the model in onnx:
X_train['pclass'] = X_train['pclass'].astype(str)
X_test['pclass'] = X_test['pclass'].astype(str)
inputs = convert_dataframe_schema(X_train, to_drop)
model_onnx = convert_sklearn(clf, 'pipeline_titanic', inputs, target_opset=7)
# And save.
with open("pipeline_titanic2.onnx", "wb") as f:
f.write(model_onnx.SerializeToString())
The versions are:
onnx: 1.5.0
onnxruntime: 1.2.0
skl2onnx: 1.6.1
Please suggest any better approach.
This looks fine, it gives you an ONNX model right? Try the code I pasted above(https://github.com/microsoft/onnxruntime/issues/3986#issuecomment-635894186) which I used on the pipeline_titanic.onnx model you had shared with me. That should work. You may update the input values.
Also, you may want to upgrade your onnx and onnxruntime packages.
This looks fine, it gives you an ONNX model right? Try the code I pasted above(#3986 (comment)) which I used on the pipeline_titanic.onnx model you had shared with me. That should work. You may update the input values.
Also, you may want to upgrade your onnx and onnxruntime packages.
I see the exmples https://github.com/microsoft/onnxruntime/blob/master/csharp/test/Microsoft.ML.OnnxRuntime.EndToEndTests.Capi/CXX_Api_Sample.cpp. But still do not understand how to convert a img in opencv mat into std::vector
It is CHW or CWH or others?
I am not a C++ expert but I think you can try something like this(under load input image): https://mxnet.apache.org/api/cpp/docs/tutorials/cpp_inference.
Whether it's NCHW or NHWC would depend on the way you trained your model.
@Karnav123 Can I close this issue, if you have no more questions?
@Karnav123 Can I close this issue, if you have no more questions?
Yes. Thank you so much for your help.