--- title: "Deploy XGBoost Model as SQL Query" author: "Chengjun Hou, Abhishek Bishoyi" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Deploy XGBoost Model as SQL Query} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` You fit a boosting tree model in R with your favourite package `xgboost`, the validation results looks great, so next question comes up as how to deploy this model into production so that others could utilize it to help with the business. Incorporating the model into a shiny app would certainly be a good idea, but sometimes the model needs to be integrated into some other systems that the company is heavily relied on. Plus moving large amount of data between database and R could be time and memory consuming. So we propose R package `xgb2sql` enabling in-database scoring of XGBoost models built in R by translating trained model objects into SQL query. [CRAN Task View: Model Deployment with R](https://CRAN.R-project.org/view=ModelDeployment) categorizes the process of deploying models to various environments for scoring or inferencing on new data into two categories. The first category is **Deployment through Different Types of Artifacts**, which basically means exporting the model as an object, then using supported software/platform to consume this object scoring out the model predictions. The other category is **Deployment through Cloud/Server**, which includes a). providing an R interface to third-party managed services such as [Google Cloud Machine Learning Engine](https://cloud.google.com/ml-engine/); b). turning R code into web API and opening service on the server. Our approach provides SQL query producing model predictions, which can be taken as a combination of the model itself plus the scoring process. The output SQL query can be treated as an artifact, but we can easily set up service for it on the database server. The SQL query generated by this tool is basic enough to be compatible with all SQL-based database products and services. Other than this tool, there are two R packages providing modeling and predicting capability inside database: - Package `ibmdbR` offers modeling and predicting capability for naive-Bayes, linear regression, decision tree, association rules, and K-means clustering. But the only supported database product and service is IBM DB2. - Package `tidypredict` leverages `dplyr` and `dbplyr` for SQL translation, supporting linear regression, generalized linear model, and random forest. Here is the outline for the rest of this vignette: - [Prepare Data in Both R and Database](#data) - [Transform XGBoost Model into SQL Query](#xgb) ## Prepare Data in Both R and Database {#data} As we know, `xgboost` only consumes numeric input for its model fitting function [^1]. So after transferring raw table in database to R as a data.frame/data.table, same one-hot encoding needs to be performed on both the table and the data.frame/data.table. Here we have function `onehot2sql()` to perform one-hot encoding on the training data in R, producing at the same the SQL query performing the exact transformation for the raw table in database. Let's start with loading the sample dataset from `ggplot2`: ```{r, message=FALSE, warning=FALSE} library(data.table) library(xgboost) library(xgb2sql) df <- data.frame(ggplot2::diamonds) head(df) ``` Funtion `onehot2sql()` is built upon base R functions `model.frame()` and `model.matrix()`. Other than consuming a data.frame as input, this function has been optimized to work with a data.table with greater efficiency. It outputs a matrix ready for model fitting following rules listed below: 1. The function treats any non-numeric columns, i.e., columns with class not being `numeric` and `integer`, as categorical and performs one-hot encoding for them. 1. The one-hot encoding doesn't remove one feature-level combination for each categorical feature, as `model.matrix` does in order to avoid issues caused by multicollinearity. Although the output matrix conveys same amount of information with and without one feature-level combination removed, thus producing similar model performance, system knowledge gained along this modeling practice is very different. If let's say the "cut" of diamonds being "Ideal" has a huge impact on its price as the target/response, removing binary column "cut.Ideal" in the output matrix would result in the predictive power of "cut.Ideal" being scattered among other "cut" columns [^2]. The model performance would be comparative, but we would miss the information that "cut" being "Ideal" is the dominate factor of price, by studying the variable importance. So as multicollinearity wouldn't be a problem for tree-based model [^3], we believe performing full one-hot encoding is more appropriate for XGBoost modeling. 1. The function keeps NAs inside both categorical and numeric features preserved. As pointed by the author of `xgboost`, the algorithm will automatically learn what is the best direction to go when a value is missing, which can be viewed as automatically "learn" what is the best imputation value for missing values based on reduction on training loss [^4]. This is one of the reasons of XGBoost being so powerful, so we are keeping all NAs in the output matrix. 1. The function outputs `meta` information tracking all the levels for each categorical feature. If it is given to the function as an input, the exact feature-level combinations will be populated, even if the new data is missing one level for a particular categorical feature, or having a new level never seen before. Available arguments of this function are, which will be explained with examples: ``` onehot2sql(data, meta=NULL, sep="_", ws_replace=TRUE, ws_replace_with="", unique_id=NULL, output_file_name=NULL, input_table_name=NULL) ``` Output of this function is a list containing: 1. `meta` data tracking the transformation. 1. matrix `model.matrix` being the data after processing which is ready for XGBoost fitting. 1. SQL query `sql` performing the exact one-hot encoding in the database. So let' take a look of its basic usage: ```{r} out <- onehot2sql(df) print(out$meta) head(out$model.matrix) ``` It should be noted that level "Very Good" for feature "cut" has been replaced with "VeryGood", with the white-space removed. This behaviour is controlled by function arguments `ws_replace=TRUE` and `ws_replace_with=""`, where other symbol can be specified to replace the white-space inside levels of categorical features. Such processing is very necessary as SQL database usually doesn't allow white-space inside its table column names. And symbol separating the feature and its levels is controlled by `sep="_"`. The output model.matrix would have all its columns reordered alphabetically. The SQL query performing one-hot encoding for the raw table is: ```{r} cat(out$sql) ``` We want to emphasise here that **an unique row identifier** inside the raw table is crucial for in-database scoring of XGBoost model. Column name of the identifier can be specified by the function argument `unique_id`, which will be passed along to the table after one-hot encoding. If it is not given, SQL query will be populated with column name "ROW_KEY" for the identifier. Similarly, "INPUT_TABLE" is used in the query if name of the raw table `input_table_name` is `NULL`. Given a valid value, the SQL query will be written to the file specified by `output_file_name`. Let's have another example with NAs and a date column: ```{r} d2 <- data.table(ggplot2::diamonds) # change column class d2[, cut:=factor(cut, ordered=FALSE)] d2[, clarity:=as.character(clarity)] # create IDate column d2[, tsdt:=as.IDate('2017-01-05')] d2[1:3, tsdt:=tsdt-1] # add NAs d2[1, clarity:=NA] d2[2, depth:=NA] head(d2) out2 <- onehot2sql(d2) head(out2$model.matrix) ``` Then let's look at when `meta` is given to data with new elements, whether `onehot2sql()` will output model.matrix with identical columns as the training data, in order to apply `predict()` to the trained model on the new data: ```{r} newdata <- d2[1:5,] # newdata has columns with new elements newdata[5, clarity:='NEW']; newdata[1,tsdt:=as.IDate('2018-05-01')] # newdata has a new column newdata[, new_col:=1] # newdata is lacking a column newdata[, cut:=NULL] head(newdata) onehot2sql(newdata, meta=out2$meta)$model.matrix ``` We can see from this example that 1. any new levels will have value of 0s on all the columns related to that feature. 1. any new features will not be in the output model.matrix. 1. the entire feature will be imputed with NAs if it is missing in the new data, and warnings will be given. We recommend any feature engineering and/or missing imputation work to be done before applying function `onehot2sql()` to the training data in R. It should be the last step before kicking off the model fitting. And SQL query for feature engineering and/or missing imputation can be placed as a sub-query inside the one-hot query. For example, replacing "INPUT_TABLE" inside `out$sql` with following sub-query will do one-hot encoding together with missing imputation for feature "clarity": ``` (SELECT ROW_KEY, [cut], [color], (case when [clarity] IS NULL then 'MISS' else [clarity] end) as [clarity], [carat], [depth], [table], [price], [x], [y], [z] FROM INPUT_TABLE) AS IMPUTED_TABLE ``` ## Transform XGBoost Model into SQL Query {#xgb} Before taking a close look at function `booster2sql()` translating XGBoost model into SQL query, we want to illustrate the suggested work-flow for the whole process of model fitting and scoring with package `xgb2sql`: 1. We start with transferring raw table from database to R as a data.frame/data.table. There are many packages supporting database connection, we recommend `dplyr` and `DBI` here. 1. After all feature engineering and missing imputation is done, apply function `onehot2sql()` to the data.frame/data.table, obtaining the model.matrix and storing the one-hot query. 1. Conduct all modeling practices until reaching a final model, then apply function `booster2sql()` to the final model, producing the XGBoost query for its in-database scoring. 1. Modeling in R is done, let's move to in-database scoring: + Execute the one-hot query on the raw table, creating the model-ready table. + Execute the XGBoost query on the model-ready table, obtaining the model predictions. + Compare the model prediction in R with the values given by the XGBoost query is always recommended. Now let's move back to function `booster2sql()`. Available arguments are: ``` booster2sql(xgbModel, print_progress=FALSE, unique_id=NULL, output_file_name=NULL, input_table_name=NULL, input_onehot_query=NULL) ``` The model input `xgbModel` to this function should have a class of `xgb.Booster`. And `print_progress=FALSE` controls whether the translating progress should be printed to console. Similarly, `unique_id` and `input_table_name` should be given to generate the SQL query. It should be noted that there must be a valid file path for `output_file_name` to write the query, otherwise the function will not run. Let's try to predict the "price" of diamonds using the other features. In order to demonstrate the generated XGBoost query, we will train the model with `max.depth=2` and `nround=2`. ```{r} x <- out$model.matrix[,colnames(out$model.matrix)!='price'] y <- out$model.matrix[,colnames(out$model.matrix)=='price'] bst <- xgboost(data = x, label = y, max.depth = 2, eta = .3, nround = 2, objective = 'reg:linear') booster2sql(bst, output_file_name='xgb.txt') ``` ```{r, warning=FALSE, message=FALSE} cat(readChar('xgb.txt', file.info('xgb.txt')$size)) ``` We can see that each `SELECT ... AS ONETREE` section inside the XGBoost query is composed of nested case when statement, providing scores along a tree structure. And each of these sections represents one round/iteration of the XGBoost model. Values for the splits and scores within the query are from the `xgb.dump()` of the model without any rounding: ```{r} xgb.dump(bst) ``` It should be noted that model prediction calculated by adding up the scores provided by `xgb.dump()`, is different from that by applying `predict()` to the model directly. It is a rounding difference thus extremely insignificant. But since the XGBoost query is generated with scores from `xgb.dump`, this difference will still be there between the in-database scoring and the R's `predict()` of the model. There is one last argument of `booster2sql()` we haven't talked about, i.e., `input_onehot_query`. Here we can input the one-hot query generated by `onehot2sql()`, which will be used as sub-query replacing "MODREADY_TABLE" within the XGBoost query. In this way, the XGBoost query can be executed on the raw table, producing the model predictions directly. ```{r} booster2sql(bst, output_file_name='onehot-xgb.txt', input_onehot_query=out$sql) ``` ```{r, warning=FALSE, message=FALSE} cat(readChar('onehot-xgb.txt', file.info('onehot-xgb.txt')$size)) ``` As processing time and query size grow exponentially with `max.depth`, linearly with `nround`, this approach of combining the one-hot query and the XGBoost query together should be used for only simple models. [^1]: [Understand your dataset with XGBoost.\ ](https://xgboost.readthedocs.io/en/latest/R-package/discoverYourData.html) [^2]: [Ensembles of tree-based models: why correlated features do not trip them and why NA matters.\ ](https://medium.com/data-design/ensembles-of-tree-based-models-why-correlated-features-do-not-trip-them-and-why-na-matters-7658f4752e1b) [^3]: [StackExchange: Does XGBoost handle multicollinearity by itself?\ ](https://datascience.stackexchange.com/questions/12554/does-xgboost-handle-multicollinearity-by-itself) [^4]: [GitHub Issue: What are the ways of treatng missing values in XGboost?\ ](https://github.com/dmlc/xgboost/issues/21)