The pmmlTransformations R package can be used to transform data and add new features to be used in predictive PMML models.

In this blog post, we will focus on `FunctionXform`

, a function introduced in version 1.3.0 of `pmmlTransformations`

, and present a few examples of using it to create new data features.

The pmmlTransformations R package can be used to transform data and add new features to be used in predictive PMML models.

In this blog post, we will focus on `FunctionXform`

, a function introduced in version 1.3.0 of `pmmlTransformations`

, and present a few examples of using it to create new data features.

## How it works

Transformations in the `pmmlTransformations`

package work in the following manner: given a `WrapData`

object and a transformation name, the code calculates data for a new feature and creates a new `WrapData`

object. This new object is then passed in as the `data`

argument when training an R model with a compatible R package. When PMML is produced with `pmml::pmml()`

, the transformation is inserted into the `LocalTransformations`

node as a `DerivedField`

. Any original fields used by transformations are added to the appropriate nodes in the resulting PMML file.

While other transformations in the package transform only one field, `FunctionXform`

makes it possible to use multiple data fields and functions to produce a new feature.

Note that while `FunctionXform`

is part of the `pmmlTransformations`

package, the code to produce PMML from R is in the `pmml`

package. The following examples require both packages to be installed to work.

To make tables more readable in this blog post, we are using the `kable`

function (part of `knitr`

).

## Single numeric field

Using the `iris`

dataset as an example, let’s construct a new numeric feature by transforming one variable.

First, load the required libraries:

`library(pmml)`

library(pmmlTransformations)

library(knitr)

Then load the data and display the first 3 lines:

`data(iris)`

kable(head(iris,3))

Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
---|---|---|---|---|

5.1 | 3.5 | 1.4 | 0.2 | setosa |

4.9 | 3.0 | 1.4 | 0.2 | setosa |

4.7 | 3.2 | 1.3 | 0.2 | setosa |

Create the `irisBox`

wrapper object with `WrapData`

:

`irisBox <- WrapData(iris)`

`irisBox`

contains the data and transform information that will be used to produce PMML later. The original data is in `irisBox$data`

. Any new features created with a transformation are added as columns to this data frame.

`kable(head(irisBox$data,3))`

Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
---|---|---|---|---|

5.1 | 3.5 | 1.4 | 0.2 | setosa |

4.9 | 3.0 | 1.4 | 0.2 | setosa |

4.7 | 3.2 | 1.3 | 0.2 | setosa |

Transform and field information is in `irisBox$fieldData`

. The fieldData data frame contains information on every field in the dataset, as well as every transform used. The `functionXform`

column contains expressions used in the `FunctionXform`

transform. Here we’ll show only a few of the columns:

`kable(irisBox$fieldData[,1:5])`

type | dataType | origFieldName | sampleMin | sampleMax | |
---|---|---|---|---|---|

Sepal.Length | original | numeric | NA | NA | NA |

Sepal.Width | original | numeric | NA | NA | NA |

Petal.Length | original | numeric | NA | NA | NA |

Petal.Width | original | numeric | NA | NA | NA |

Species | original | factor | NA | NA | NA |

Now add a new feature, `Sepal.Length.Sqrt`

, using `FunctionXform`

:

`irisBox <- FunctionXform(irisBox,origFieldName="Sepal.Length",`

newFieldName="Sepal.Length.Sqrt",

formulaText="sqrt(Sepal.Length)")

The new feature is calculated and added as a column to the `irisBox$data`

data frame:

`kable(head(irisBox$data,3))`

Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | Sepal.Length.Sqrt |
---|---|---|---|---|---|

5.1 | 3.5 | 1.4 | 0.2 | setosa | 2.258318 |

4.9 | 3.0 | 1.4 | 0.2 | setosa | 2.213594 |

4.7 | 3.2 | 1.3 | 0.2 | setosa | 2.167948 |

`irisBox$fieldData`

now contains a new row with the transformation expression in the `functionXform`

column:

`kable(irisBox$fieldData[6,c(1:3,14)])`

type | dataType | origFieldName | functionXform | |
---|---|---|---|---|

Sepal.Length.Sqrt | derived | numeric | Sepal.Length | sqrt(Sepal.Length) |

Construct a linear model to predict `Petal.Width`

using this new feature, and convert it to PMML:

`fit <- lm(Petal.Width ~ Sepal.Length.Sqrt, data=irisBox$data)`

fit_pmml <- pmml(fit, transform=irisBox)

Since the model predicts `Petal.Width`

using a variable based on `Sepal.Length`

, `Sepal.Length`

will be added to the `DataDictionary`

and `MiningSchema`

nodes in the resulting PMML. We can take a look at the relevant parts of the output like so:

`fit_pmml[[2]] #Data Dictionary node`

#>

#>

#>

#>

fit_pmml[[3]][[1]] #Mining Schema node

#>

#>

#>

#>

The `LocalTransformations`

node contains `Sepal.Length.Sqrt`

as a derived field:

`fit_pmml[[3]][[3]]#> `

#>

#>

#>

#>

#>

#>

The PMML model can now be deployed and consumed. For any input data, the new `Sepal.Length.Sqrt`

feature will be created when the data is scored against the model.

## Multiple input fields

It is also possible to create new features by combining several fields. Using the same `iris`

dataset, let’s create a new field using squares of `Sepal.Length`

and `Petal.Length`

:

`irisBox <- WrapData(iris)`

irisBox <- FunctionXform(irisBox,origFieldName="Sepal.Length,Petal.Length",

newFieldName="Squared.Length.Ratio",

formulaText="(Sepal.Length / Petal.Length)^2")

As before, the new field is added as a column to the `irisBox$data`

data frame:

`kable(head(irisBox$data,3))`

Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | Squared.Length.Ratio |
---|---|---|---|---|---|

5.1 | 3.5 | 1.4 | 0.2 | setosa | 13.27041 |

4.9 | 3.0 | 1.4 | 0.2 | setosa | 12.25000 |

4.7 | 3.2 | 1.3 | 0.2 | setosa | 13.07101 |

Fit a linear model for `Petal.Length`

using this new feature, and convert it to PMML:

`fit <- lm(Petal.Width ~ Squared.Length.Ratio, data=irisBox$data)`

fit_pmml <- pmml(fit, transform=irisBox)

The PMML will contain `Sepal.Length`

and `Petal.Length`

in the `DataDictionary`

and `MiningSchema`

, since these were used in `FormulaXform`

:

`fit_pmml[[2]] #Data Dictionary node`

#>

#>

#>

#>

#>

fit_pmml[[3]][[1]] #Mining Schema node

#>

#>

#>

#>

#>

The `Local.Transformations`

node contains `Squared.Length.Ratio`

as a derived field:

`fit_pmml[[3]][[3]]#> `

#>

#>

#>

#>

#>

#>

#> 2

#>

#>

#>

## PMML for arbitrary functions

The function `functionToPMML`

(part of the `pmml`

package) makes it possible to convert an R expression into PMML directly, without creating a model or calculating values. This can be useful for debugging.

As long as the expression passed to the function is a valid R expression (e.g., no unbalanced parentheses), it can contain arbitrary function names not defined in R. Constants in the expression passed to `FunctionXform`

are always assumed to be of type `double`

. Variables in the expression are always assumed to be field names, and not substituted. That is, even if `x`

has a value in the R environment, the resulting expression will still use `x`

.

`functionToPMML("1 + 2")`

#>

#> 1

#> 2

#>

x <- 3

functionToPMML("foo(bar(x * y))")

#>

#>

#>

#>

#>

#>

#>

#>

functionToPMML("if(a<2) else if (a>3) {'four'} else ")

#>

#>

#>

#> 2

#>

#>

#>

#> 3

#>

#>

#>

#>

#> 3

#>

#> four

#> 5

#>

#>

## Conclusion

`functionXform`

makes it possible to easily create new features for PMML models with R.

The `pmmlTransformations`

`functionXform`

vignette contains additional examples, including transforming categorical data, using transformed features in another transform, unsupported functions, and notes on limitations of the function.