You fit a boosting tree model in R
with your favourite package xgboost
, the validation results
looks great, so next question comes up as how to deploy this model into
production so that others could utilize it to help with the business.
Incorporating the model into a shiny app would certainly be a good idea,
but sometimes the model needs to be integrated into some other systems
that the company is heavily relied on. Plus moving large amount of data
between database and R could be time and memory consuming. So we propose
R package xgb2sql
enabling in-database scoring of XGBoost
models built in R by translating trained model objects into SQL
query.
CRAN Task View: Model Deployment with R categorizes the process of deploying models to various environments for scoring or inferencing on new data into two categories. The first category is Deployment through Different Types of Artifacts, which basically means exporting the model as an object, then using supported software/platform to consume this object scoring out the model predictions. The other category is Deployment through Cloud/Server, which includes a). providing an R interface to third-party managed services such as Google Cloud Machine Learning Engine; b). turning R code into web API and opening service on the server. Our approach provides SQL query producing model predictions, which can be taken as a combination of the model itself plus the scoring process. The output SQL query can be treated as an artifact, but we can easily set up service for it on the database server.
The SQL query generated by this tool is basic enough to be compatible with all SQL-based database products and services. Other than this tool, there are two R packages providing modeling and predicting capability inside database:
ibmdbR
offers modeling and predicting
capability for naive-Bayes, linear regression, decision tree,
association rules, and K-means clustering. But the only supported
database product and service is IBM DB2.tidypredict
leverages dplyr
and
dbplyr
for SQL translation, supporting linear regression,
generalized linear model, and random forest.Here is the outline for the rest of this vignette:
As we know, xgboost
only consumes numeric input for its
model fitting function 1. So after transferring raw table in
database to R as a data.frame/data.table, same one-hot encoding needs to
be performed on both the table and the data.frame/data.table. Here we
have function onehot2sql()
to perform one-hot encoding on
the training data in R, producing at the same the SQL query performing
the exact transformation for the raw table in database. Let’s start with
loading the sample dataset from ggplot2
:
library(data.table)
library(xgboost)
library(xgb2sql)
df <- data.frame(ggplot2::diamonds)
head(df)
#> carat cut color clarity depth table price x y z
#> 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
#> 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
#> 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
#> 4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
#> 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
#> 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
Funtion onehot2sql()
is built upon base R functions
model.frame()
and model.matrix()
. Other than
consuming a data.frame as input, this function has been optimized to
work with a data.table with greater efficiency. It outputs a matrix
ready for model fitting following rules listed below:
The function treats any non-numeric columns, i.e., columns with
class not being numeric
and integer
, as
categorical and performs one-hot encoding for them.
The one-hot encoding doesn’t remove one feature-level combination
for each categorical feature, as model.matrix
does in order
to avoid issues caused by multicollinearity. Although the output matrix
conveys same amount of information with and without one feature-level
combination removed, thus producing similar model performance, system
knowledge gained along this modeling practice is very different. If
let’s say the “cut” of diamonds being “Ideal” has a huge impact on its
price as the target/response, removing binary column “cut.Ideal” in the
output matrix would result in the predictive power of “cut.Ideal” being
scattered among other “cut” columns 2. The model performance would be
comparative, but we would miss the information that “cut” being “Ideal”
is the dominate factor of price, by studying the variable importance. So
as multicollinearity wouldn’t be a problem for tree-based model 3, we
believe performing full one-hot encoding is more appropriate for XGBoost
modeling.
The function keeps NAs inside both categorical and numeric
features preserved. As pointed by the author of xgboost
,
the algorithm will automatically learn what is the best direction to go
when a value is missing, which can be viewed as automatically “learn”
what is the best imputation value for missing values based on reduction
on training loss 4. This is one of the reasons of XGBoost
being so powerful, so we are keeping all NAs in the output
matrix.
The function outputs meta
information tracking all
the levels for each categorical feature. If it is given to the function
as an input, the exact feature-level combinations will be populated,
even if the new data is missing one level for a particular categorical
feature, or having a new level never seen before.
Available arguments of this function are, which will be explained with examples:
onehot2sql(data, meta=NULL, sep="_", ws_replace=TRUE, ws_replace_with="",
unique_id=NULL, output_file_name=NULL, input_table_name=NULL)
Output of this function is a list containing:
meta
data tracking the transformation.model.matrix
being the data after processing
which is ready for XGBoost fitting.sql
performing the exact one-hot encoding in
the database.So let’ take a look of its basic usage:
out <- onehot2sql(df)
print(out$meta)
#> $num.vec
#> [1] "carat" "depth" "table" "price" "x" "y" "z"
#>
#> $catg.vec
#> [1] "cut" "color" "clarity"
#>
#> $contrasts
#> $contrasts$cut_
#> Fair Good Very Good Premium Ideal
#> Fair 1 0 0 0 0
#> Good 0 1 0 0 0
#> Very Good 0 0 1 0 0
#> Premium 0 0 0 1 0
#> Ideal 0 0 0 0 1
#>
#> $contrasts$color_
#> D E F G H I J
#> D 1 0 0 0 0 0 0
#> E 0 1 0 0 0 0 0
#> F 0 0 1 0 0 0 0
#> G 0 0 0 1 0 0 0
#> H 0 0 0 0 1 0 0
#> I 0 0 0 0 0 1 0
#> J 0 0 0 0 0 0 1
#>
#> $contrasts$clarity_
#> I1 SI2 SI1 VS2 VS1 VVS2 VVS1 IF
#> I1 1 0 0 0 0 0 0 0
#> SI2 0 1 0 0 0 0 0 0
#> SI1 0 0 1 0 0 0 0 0
#> VS2 0 0 0 1 0 0 0 0
#> VS1 0 0 0 0 1 0 0 0
#> VVS2 0 0 0 0 0 1 0 0
#> VVS1 0 0 0 0 0 0 1 0
#> IF 0 0 0 0 0 0 0 1
head(out$model.matrix)
#> (Intercept) carat clarity_I1 clarity_IF clarity_SI1 clarity_SI2 clarity_VS1
#> 1 1 0.23 0 0 0 1 0
#> 2 1 0.21 0 0 1 0 0
#> 3 1 0.23 0 0 0 0 1
#> 4 1 0.29 0 0 0 0 0
#> 5 1 0.31 0 0 0 1 0
#> 6 1 0.24 0 0 0 0 0
#> clarity_VS2 clarity_VVS1 clarity_VVS2 color_D color_E color_F color_G color_H
#> 1 0 0 0 0 1 0 0 0
#> 2 0 0 0 0 1 0 0 0
#> 3 0 0 0 0 1 0 0 0
#> 4 1 0 0 0 0 0 0 0
#> 5 0 0 0 0 0 0 0 0
#> 6 0 0 1 0 0 0 0 0
#> color_I color_J cut_Fair cut_Good cut_Ideal cut_Premium cut_VeryGood depth
#> 1 0 0 0 0 1 0 0 61.5
#> 2 0 0 0 0 0 1 0 59.8
#> 3 0 0 0 1 0 0 0 56.9
#> 4 1 0 0 0 0 1 0 62.4
#> 5 0 1 0 1 0 0 0 63.3
#> 6 0 1 0 0 0 0 1 62.8
#> price table x y z
#> 1 326 55 3.95 3.98 2.43
#> 2 326 61 3.89 3.84 2.31
#> 3 327 65 4.05 4.07 2.31
#> 4 334 58 4.20 4.23 2.63
#> 5 335 58 4.34 4.35 2.75
#> 6 336 57 3.94 3.96 2.48
It should be noted that level “Very Good” for feature “cut” has been
replaced with “VeryGood”, with the white-space removed. This behaviour
is controlled by function arguments ws_replace=TRUE
and
ws_replace_with=""
, where other symbol can be specified to
replace the white-space inside levels of categorical features. Such
processing is very necessary as SQL database usually doesn’t allow
white-space inside its table column names. And symbol separating the
feature and its levels is controlled by sep="_"
. The output
model.matrix would have all its columns reordered alphabetically.
The SQL query performing one-hot encoding for the raw table is:
cat(out$sql)
#> SELECT ROW_KEY, [carat], [depth], [table], [price], [x], [y], [z],
#> (case when [cut] IS NULL then NULL when [cut] = 'Fair' then 1 else 0 end) AS [cut_Fair],
#> (case when [cut] IS NULL then NULL when [cut] = 'Good' then 1 else 0 end) AS [cut_Good],
#> (case when [cut] IS NULL then NULL when [cut] = 'Very Good' then 1 else 0 end) AS [cut_VeryGood],
#> (case when [cut] IS NULL then NULL when [cut] = 'Premium' then 1 else 0 end) AS [cut_Premium],
#> (case when [cut] IS NULL then NULL when [cut] = 'Ideal' then 1 else 0 end) AS [cut_Ideal],
#> (case when [color] IS NULL then NULL when [color] = 'D' then 1 else 0 end) AS [color_D],
#> (case when [color] IS NULL then NULL when [color] = 'E' then 1 else 0 end) AS [color_E],
#> (case when [color] IS NULL then NULL when [color] = 'F' then 1 else 0 end) AS [color_F],
#> (case when [color] IS NULL then NULL when [color] = 'G' then 1 else 0 end) AS [color_G],
#> (case when [color] IS NULL then NULL when [color] = 'H' then 1 else 0 end) AS [color_H],
#> (case when [color] IS NULL then NULL when [color] = 'I' then 1 else 0 end) AS [color_I],
#> (case when [color] IS NULL then NULL when [color] = 'J' then 1 else 0 end) AS [color_J],
#> (case when [clarity] IS NULL then NULL when [clarity] = 'I1' then 1 else 0 end) AS [clarity_I1],
#> (case when [clarity] IS NULL then NULL when [clarity] = 'SI2' then 1 else 0 end) AS [clarity_SI2],
#> (case when [clarity] IS NULL then NULL when [clarity] = 'SI1' then 1 else 0 end) AS [clarity_SI1],
#> (case when [clarity] IS NULL then NULL when [clarity] = 'VS2' then 1 else 0 end) AS [clarity_VS2],
#> (case when [clarity] IS NULL then NULL when [clarity] = 'VS1' then 1 else 0 end) AS [clarity_VS1],
#> (case when [clarity] IS NULL then NULL when [clarity] = 'VVS2' then 1 else 0 end) AS [clarity_VVS2],
#> (case when [clarity] IS NULL then NULL when [clarity] = 'VVS1' then 1 else 0 end) AS [clarity_VVS1],
#> (case when [clarity] IS NULL then NULL when [clarity] = 'IF' then 1 else 0 end) AS [clarity_IF]
#> FROM INPUT_TABLE
We want to emphasise here that an unique row
identifier inside the raw table is crucial for in-database
scoring of XGBoost model. Column name of the identifier can be specified
by the function argument unique_id
, which will be passed
along to the table after one-hot encoding. If it is not given, SQL query
will be populated with column name “ROW_KEY” for the identifier.
Similarly, “INPUT_TABLE” is used in the query if name of the raw table
input_table_name
is NULL
. Given a valid value,
the SQL query will be written to the file specified by
output_file_name
.
Let’s have another example with NAs and a date column:
d2 <- data.table(ggplot2::diamonds)
# change column class
d2[, cut:=factor(cut, ordered=FALSE)]
d2[, clarity:=as.character(clarity)]
# create IDate column
d2[, tsdt:=as.IDate('2017-01-05')]
d2[1:3, tsdt:=tsdt-1]
# add NAs
d2[1, clarity:=NA]
d2[2, depth:=NA]
head(d2)
#> carat cut color clarity depth table price x y z tsdt
#> <num> <fctr> <ord> <char> <num> <num> <int> <num> <num> <num> <IDat>
#> 1: 0.23 Ideal E <NA> 61.5 55 326 3.95 3.98 2.43 2017-01-04
#> 2: 0.21 Premium E SI1 NA 61 326 3.89 3.84 2.31 2017-01-04
#> 3: 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 2017-01-04
#> 4: 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63 2017-01-05
#> 5: 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 2017-01-05
#> 6: 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 2017-01-05
out2 <- onehot2sql(d2)
head(out2$model.matrix)
#> (Intercept) carat clarity_I1 clarity_IF clarity_SI1 clarity_SI2 clarity_VS1
#> 1 1 0.23 NA NA NA NA NA
#> 2 1 0.21 0 0 1 0 0
#> 3 1 0.23 0 0 0 0 1
#> 4 1 0.29 0 0 0 0 0
#> 5 1 0.31 0 0 0 1 0
#> 6 1 0.24 0 0 0 0 0
#> clarity_VS2 clarity_VVS1 clarity_VVS2 color_D color_E color_F color_G color_H
#> 1 NA NA NA 0 1 0 0 0
#> 2 0 0 0 0 1 0 0 0
#> 3 0 0 0 0 1 0 0 0
#> 4 1 0 0 0 0 0 0 0
#> 5 0 0 0 0 0 0 0 0
#> 6 0 0 1 0 0 0 0 0
#> color_I color_J cut_Fair cut_Good cut_Ideal cut_Premium cut_VeryGood depth
#> 1 0 0 0 0 1 0 0 61.5
#> 2 0 0 0 0 0 1 0 NA
#> 3 0 0 0 1 0 0 0 56.9
#> 4 1 0 0 0 0 1 0 62.4
#> 5 0 1 0 1 0 0 0 63.3
#> 6 0 1 0 0 0 0 1 62.8
#> price table tsdt_20170104 tsdt_20170105 x y z
#> 1 326 55 1 0 3.95 3.98 2.43
#> 2 326 61 1 0 3.89 3.84 2.31
#> 3 327 65 1 0 4.05 4.07 2.31
#> 4 334 58 0 1 4.20 4.23 2.63
#> 5 335 58 0 1 4.34 4.35 2.75
#> 6 336 57 0 1 3.94 3.96 2.48
Then let’s look at when meta
is given to data with new
elements, whether onehot2sql()
will output model.matrix
with identical columns as the training data, in order to apply
predict()
to the trained model on the new data:
newdata <- d2[1:5,]
# newdata has columns with new elements
newdata[5, clarity:='NEW']; newdata[1,tsdt:=as.IDate('2018-05-01')]
# newdata has a new column
newdata[, new_col:=1]
# newdata is lacking a column
newdata[, cut:=NULL]
head(newdata)
#> carat color clarity depth table price x y z tsdt new_col
#> <num> <ord> <char> <num> <num> <int> <num> <num> <num> <IDat> <num>
#> 1: 0.23 E <NA> 61.5 55 326 3.95 3.98 2.43 2018-05-01 1
#> 2: 0.21 E SI1 NA 61 326 3.89 3.84 2.31 2017-01-04 1
#> 3: 0.23 E VS1 56.9 65 327 4.05 4.07 2.31 2017-01-04 1
#> 4: 0.29 I VS2 62.4 58 334 4.20 4.23 2.63 2017-01-05 1
#> 5: 0.31 J NEW 63.3 58 335 4.34 4.35 2.75 2017-01-05 1
onehot2sql(newdata, meta=out2$meta)$model.matrix
#> Warning in onehot2sql(newdata, meta = out2$meta): Following columns are populated with NAs:
#> cut
#> (Intercept) carat clarity_I1 clarity_IF clarity_SI1 clarity_SI2 clarity_VS1
#> 1 1 0.23 NA NA NA NA NA
#> 2 1 0.21 0 0 1 0 0
#> 3 1 0.23 0 0 0 0 1
#> 4 1 0.29 0 0 0 0 0
#> 5 1 0.31 0 0 0 0 0
#> clarity_VS2 clarity_VVS1 clarity_VVS2 color_D color_E color_F color_G color_H
#> 1 NA NA NA 0 1 0 0 0
#> 2 0 0 0 0 1 0 0 0
#> 3 0 0 0 0 1 0 0 0
#> 4 1 0 0 0 0 0 0 0
#> 5 0 0 0 0 0 0 0 0
#> color_I color_J cut_Fair cut_Good cut_Ideal cut_Premium cut_VeryGood depth
#> 1 0 0 NA NA NA NA NA 61.5
#> 2 0 0 NA NA NA NA NA NA
#> 3 0 0 NA NA NA NA NA 56.9
#> 4 1 0 NA NA NA NA NA 62.4
#> 5 0 1 NA NA NA NA NA 63.3
#> price table tsdt_20170104 tsdt_20170105 x y z
#> 1 326 55 0 0 3.95 3.98 2.43
#> 2 326 61 1 0 3.89 3.84 2.31
#> 3 327 65 1 0 4.05 4.07 2.31
#> 4 334 58 0 1 4.20 4.23 2.63
#> 5 335 58 0 1 4.34 4.35 2.75
We can see from this example that
We recommend any feature engineering and/or missing imputation work
to be done before applying function onehot2sql()
to the
training data in R. It should be the last step before kicking off the
model fitting. And SQL query for feature engineering and/or missing
imputation can be placed as a sub-query inside the one-hot query. For
example, replacing “INPUT_TABLE” inside out$sql
with
following sub-query will do one-hot encoding together with missing
imputation for feature “clarity”:
(SELECT ROW_KEY, [cut], [color],
(case when [clarity] IS NULL then 'MISS' else [clarity] end) as [clarity],
[carat], [depth], [table], [price], [x], [y], [z]
FROM INPUT_TABLE) AS IMPUTED_TABLE
Before taking a close look at function booster2sql()
translating XGBoost model into SQL query, we want to illustrate the
suggested work-flow for the whole process of model fitting and scoring
with package xgb2sql
:
dplyr
and DBI
here.onehot2sql()
to the data.frame/data.table,
obtaining the model.matrix and storing the one-hot query.booster2sql()
to the final model, producing
the XGBoost query for its in-database scoring.Now let’s move back to function booster2sql()
. Available
arguments are:
booster2sql(xgbModel, print_progress=FALSE, unique_id=NULL,
output_file_name=NULL, input_table_name=NULL, input_onehot_query=NULL)
The model input xgbModel
to this function should have a
class of xgb.Booster
. And print_progress=FALSE
controls whether the translating progress should be printed to console.
Similarly, unique_id
and input_table_name
should be given to generate the SQL query. It should be noted that there
must be a valid file path for output_file_name
to write the
query, otherwise the function will not run.
Let’s try to predict the “price” of diamonds using the other
features. In order to demonstrate the generated XGBoost query, we will
train the model with max.depth=2
and
nround=2
.
x <- out$model.matrix[,colnames(out$model.matrix)!='price']
y <- out$model.matrix[,colnames(out$model.matrix)=='price']
bst <- xgboost(data = x,
label = y,
max.depth = 2,
eta = .3,
nround = 2,
objective = 'reg:linear')
#> [03:08:26] WARNING: src/objective/regression_obj.cu:213: reg:linear is now deprecated in favor of reg:squarederror.
#> [1] train-rmse:4095.428009
#> [2] train-rmse:3074.223184
booster2sql(bst, output_file_name='xgb.txt')
#> query is written to file with row unique id named as ROW_KEY
#> query is written to file with input table named as MODREADY_TABLE
cat(readChar('xgb.txt', file.info('xgb.txt')$size))
#> SELECT ROW_KEY , 0.5 + SUM(ONETREE) AS XGB_PRED
#> FROM (
#> SELECT ROW_KEY ,
#> (CASE WHEN [carat] < 0.995000005 THEN
#> (CASE WHEN [y] < 5.53499985 THEN 317.401001
#> WHEN [y] >= 5.53499985 THEN 922.349731
#> WHEN [y] IS NULL THEN 317.401001 END)
#> WHEN [carat] >= 0.995000005 THEN
#> (CASE WHEN [y] < 7.19499969 THEN 1841.06018
#> WHEN [y] >= 7.19499969 THEN 3696.24292
#> WHEN [y] IS NULL THEN 1841.06018 END)
#> WHEN [carat] IS NULL THEN
#> (CASE WHEN [y] < 5.53499985 THEN 317.401001
#> WHEN [y] >= 5.53499985 THEN 922.349731
#> WHEN [y] IS NULL THEN 317.401001 END) END) AS ONETREE FROM MODREADY_TABLE
#> UNION ALL
#>
#> SELECT ROW_KEY ,
#> (CASE WHEN [y] < 6.69499969 THEN
#> (CASE WHEN [carat] < 0.824999988 THEN 289.332123
#> WHEN [carat] >= 0.824999988 THEN 1056.4021
#> WHEN [carat] IS NULL THEN 289.332123 END)
#> WHEN [y] >= 6.69499969 THEN
#> (CASE WHEN [y] < 7.65499973 THEN 1814.65881
#> WHEN [y] >= 7.65499973 THEN 3217.57129
#> WHEN [y] IS NULL THEN 1814.65881 END)
#> WHEN [y] IS NULL THEN
#> (CASE WHEN [carat] < 0.824999988 THEN 289.332123
#> WHEN [carat] >= 0.824999988 THEN 1056.4021
#> WHEN [carat] IS NULL THEN 289.332123 END) END) AS ONETREE FROM MODREADY_TABLE
#> ) AS TREES_TABLE GROUP BY ROW_KEY
We can see that each SELECT ... AS ONETREE
section
inside the XGBoost query is composed of nested case when statement,
providing scores along a tree structure. And each of these sections
represents one round/iteration of the XGBoost model. Values for the
splits and scores within the query are from the xgb.dump()
of the model without any rounding:
xgb.dump(bst)
#> [1] "booster[0]"
#> [2] "0:[f1<0.995000005] yes=1,no=2,missing=1"
#> [3] "1:[f25<5.53499985] yes=3,no=4,missing=3"
#> [4] "3:leaf=317.401001"
#> [5] "4:leaf=922.349731"
#> [6] "2:[f25<7.19499969] yes=5,no=6,missing=5"
#> [7] "5:leaf=1841.06018"
#> [8] "6:leaf=3696.24292"
#> [9] "booster[1]"
#> [10] "0:[f25<6.69499969] yes=1,no=2,missing=1"
#> [11] "1:[f1<0.824999988] yes=3,no=4,missing=3"
#> [12] "3:leaf=289.332123"
#> [13] "4:leaf=1056.4021"
#> [14] "2:[f25<7.65499973] yes=5,no=6,missing=5"
#> [15] "5:leaf=1814.65881"
#> [16] "6:leaf=3217.57129"
It should be noted that model prediction calculated by adding up the
scores provided by xgb.dump()
, is different from that by
applying predict()
to the model directly. It is a rounding
difference thus extremely insignificant. But since the XGBoost query is
generated with scores from xgb.dump
, this difference will
still be there between the in-database scoring and the R’s
predict()
of the model.
There is one last argument of booster2sql()
we haven’t
talked about, i.e., input_onehot_query
. Here we can input
the one-hot query generated by onehot2sql()
, which will be
used as sub-query replacing “MODREADY_TABLE” within the XGBoost query.
In this way, the XGBoost query can be executed on the raw table,
producing the model predictions directly.
booster2sql(bst, output_file_name='onehot-xgb.txt', input_onehot_query=out$sql)
#> query is written to file with row unique id named as ROW_KEY
cat(readChar('onehot-xgb.txt', file.info('onehot-xgb.txt')$size))
#> SELECT ROW_KEY , 0.5 + SUM(ONETREE) AS XGB_PRED
#> FROM (
#> SELECT ROW_KEY ,
#> (CASE WHEN [carat] < 0.995000005 THEN
#> (CASE WHEN [y] < 5.53499985 THEN 317.401001
#> WHEN [y] >= 5.53499985 THEN 922.349731
#> WHEN [y] IS NULL THEN 317.401001 END)
#> WHEN [carat] >= 0.995000005 THEN
#> (CASE WHEN [y] < 7.19499969 THEN 1841.06018
#> WHEN [y] >= 7.19499969 THEN 3696.24292
#> WHEN [y] IS NULL THEN 1841.06018 END)
#> WHEN [carat] IS NULL THEN
#> (CASE WHEN [y] < 5.53499985 THEN 317.401001
#> WHEN [y] >= 5.53499985 THEN 922.349731
#> WHEN [y] IS NULL THEN 317.401001 END) END) AS ONETREE FROM (
#> SELECT ROW_KEY, [carat], [depth], [table], [price], [x], [y], [z],
#> (case when [cut] IS NULL then NULL when [cut] = 'Fair' then 1 else 0 end) AS [cut_Fair],
#> (case when [cut] IS NULL then NULL when [cut] = 'Good' then 1 else 0 end) AS [cut_Good],
#> (case when [cut] IS NULL then NULL when [cut] = 'Very Good' then 1 else 0 end) AS [cut_VeryGood],
#> (case when [cut] IS NULL then NULL when [cut] = 'Premium' then 1 else 0 end) AS [cut_Premium],
#> (case when [cut] IS NULL then NULL when [cut] = 'Ideal' then 1 else 0 end) AS [cut_Ideal],
#> (case when [color] IS NULL then NULL when [color] = 'D' then 1 else 0 end) AS [color_D],
#> (case when [color] IS NULL then NULL when [color] = 'E' then 1 else 0 end) AS [color_E],
#> (case when [color] IS NULL then NULL when [color] = 'F' then 1 else 0 end) AS [color_F],
#> (case when [color] IS NULL then NULL when [color] = 'G' then 1 else 0 end) AS [color_G],
#> (case when [color] IS NULL then NULL when [color] = 'H' then 1 else 0 end) AS [color_H],
#> (case when [color] IS NULL then NULL when [color] = 'I' then 1 else 0 end) AS [color_I],
#> (case when [color] IS NULL then NULL when [color] = 'J' then 1 else 0 end) AS [color_J],
#> (case when [clarity] IS NULL then NULL when [clarity] = 'I1' then 1 else 0 end) AS [clarity_I1],
#> (case when [clarity] IS NULL then NULL when [clarity] = 'SI2' then 1 else 0 end) AS [clarity_SI2],
#> (case when [clarity] IS NULL then NULL when [clarity] = 'SI1' then 1 else 0 end) AS [clarity_SI1],
#> (case when [clarity] IS NULL then NULL when [clarity] = 'VS2' then 1 else 0 end) AS [clarity_VS2],
#> (case when [clarity] IS NULL then NULL when [clarity] = 'VS1' then 1 else 0 end) AS [clarity_VS1],
#> (case when [clarity] IS NULL then NULL when [clarity] = 'VVS2' then 1 else 0 end) AS [clarity_VVS2],
#> (case when [clarity] IS NULL then NULL when [clarity] = 'VVS1' then 1 else 0 end) AS [clarity_VVS1],
#> (case when [clarity] IS NULL then NULL when [clarity] = 'IF' then 1 else 0 end) AS [clarity_IF]
#> FROM INPUT_TABLE
#> ) AS MODREADY_TABLE
#> UNION ALL
#>
#> SELECT ROW_KEY ,
#> (CASE WHEN [y] < 6.69499969 THEN
#> (CASE WHEN [carat] < 0.824999988 THEN 289.332123
#> WHEN [carat] >= 0.824999988 THEN 1056.4021
#> WHEN [carat] IS NULL THEN 289.332123 END)
#> WHEN [y] >= 6.69499969 THEN
#> (CASE WHEN [y] < 7.65499973 THEN 1814.65881
#> WHEN [y] >= 7.65499973 THEN 3217.57129
#> WHEN [y] IS NULL THEN 1814.65881 END)
#> WHEN [y] IS NULL THEN
#> (CASE WHEN [carat] < 0.824999988 THEN 289.332123
#> WHEN [carat] >= 0.824999988 THEN 1056.4021
#> WHEN [carat] IS NULL THEN 289.332123 END) END) AS ONETREE FROM (
#> SELECT ROW_KEY, [carat], [depth], [table], [price], [x], [y], [z],
#> (case when [cut] IS NULL then NULL when [cut] = 'Fair' then 1 else 0 end) AS [cut_Fair],
#> (case when [cut] IS NULL then NULL when [cut] = 'Good' then 1 else 0 end) AS [cut_Good],
#> (case when [cut] IS NULL then NULL when [cut] = 'Very Good' then 1 else 0 end) AS [cut_VeryGood],
#> (case when [cut] IS NULL then NULL when [cut] = 'Premium' then 1 else 0 end) AS [cut_Premium],
#> (case when [cut] IS NULL then NULL when [cut] = 'Ideal' then 1 else 0 end) AS [cut_Ideal],
#> (case when [color] IS NULL then NULL when [color] = 'D' then 1 else 0 end) AS [color_D],
#> (case when [color] IS NULL then NULL when [color] = 'E' then 1 else 0 end) AS [color_E],
#> (case when [color] IS NULL then NULL when [color] = 'F' then 1 else 0 end) AS [color_F],
#> (case when [color] IS NULL then NULL when [color] = 'G' then 1 else 0 end) AS [color_G],
#> (case when [color] IS NULL then NULL when [color] = 'H' then 1 else 0 end) AS [color_H],
#> (case when [color] IS NULL then NULL when [color] = 'I' then 1 else 0 end) AS [color_I],
#> (case when [color] IS NULL then NULL when [color] = 'J' then 1 else 0 end) AS [color_J],
#> (case when [clarity] IS NULL then NULL when [clarity] = 'I1' then 1 else 0 end) AS [clarity_I1],
#> (case when [clarity] IS NULL then NULL when [clarity] = 'SI2' then 1 else 0 end) AS [clarity_SI2],
#> (case when [clarity] IS NULL then NULL when [clarity] = 'SI1' then 1 else 0 end) AS [clarity_SI1],
#> (case when [clarity] IS NULL then NULL when [clarity] = 'VS2' then 1 else 0 end) AS [clarity_VS2],
#> (case when [clarity] IS NULL then NULL when [clarity] = 'VS1' then 1 else 0 end) AS [clarity_VS1],
#> (case when [clarity] IS NULL then NULL when [clarity] = 'VVS2' then 1 else 0 end) AS [clarity_VVS2],
#> (case when [clarity] IS NULL then NULL when [clarity] = 'VVS1' then 1 else 0 end) AS [clarity_VVS1],
#> (case when [clarity] IS NULL then NULL when [clarity] = 'IF' then 1 else 0 end) AS [clarity_IF]
#> FROM INPUT_TABLE
#> ) AS MODREADY_TABLE
#> ) AS TREES_TABLE GROUP BY ROW_KEY
As processing time and query size grow exponentially with
max.depth
, linearly with nround
, this approach
of combining the one-hot query and the XGBoost query together should be
used for only simple models.