XGBoost
Dataset
Labels
XGBoost classifiers show inconsistent support for complex class labels across major versions.
XGBoost versions 0.X through 1.X follow SkLearn conventions, allowing complex class labels as input to the
fit(X, y)
method and returning them from the
predict(X)
method.
However, XGBoost version 2.0 removed this internal label encoding logic, and newer versions have not restored it.
Use the
LabelEncoder
transformer for manual label encoding and decoding.
Be aware that it sorts strings labels in ascending lexicographic order, which may lead to conflicts during binary classification that assumes strict "no-event followed by event" order.
For example, string labels
["0", "1"]
sort correctly, but
["one", "zero"]
do not.
Customize the order by (re-)setting the
classes_
attribute:
from sklearn.preprocessing import LabelEncoder
import numpy
le = LabelEncoder()
le.classes_ = numpy.asarray(["zero", "one"])
# Encode from strings to integers
# Avoid calling `fit` or `fit_transform` as it would override the custom order
y_xgb = le.transform(y).reshape((-1, ))
# Fit and predict integers
yt_xgb = classifier.fit_predict(X, y_xgb)
# Decode from integers back to strings
yt = le.inverse_transform(yt_xgb)
Attach the label encoder to the classifier using the
_le
attribute:
classifier._le = le
Workflow
Ordinal encoding
Group columns by operational type into continuous and categorical subsets.
The continuous subset may be further grouped by data type into floating-point and integer subsets.
However, this refinement is unnecessary from the training perspective, because
DMatrix
munges all feature data into
float32
values internally.
General recipe : pass continuous columns through unchanged, and encode categorical columns to integers.
If you desire categorical splits, then the list of suitable transformers is limited to
OrdinalEncoder
(direct mapping) and
TargetEncoder
(pre-processed mapping).
Most importantly, avoid the
OneHotEncoder
transformer, because it binarizes data making advanced partitioning operations infeasible.
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OrdinalEncoder
transformer = ColumnTransformer([
("continuous", "passthrough", continuous_cols),
("categorical", OrdinalEncoder(), categorical_cols)
])
This column transformer produces a homogeneous NumPy array.
Just like LightGBM estimators , XGBoost estimators in their default configuraion would assume that all columns are continuous columns, and partition them using continuous splits.
Override this behaviour using the
feature_types
parameter.
The appropriate value is a list of string enums, where
q
denotes continuous columns and
c
categorical ones.
Generating and passing
feature_types
:
from xgboost import XGBClassifier
feature_types = ["q"] * len(continuous_cols) + ["c"] * len(categorical_cols)
classsifier = XGBClassifier(feature_types = feature_types)
Casting
The main difference is converting any user-provided data container to Pandas' dataframe, which then enables casting columns into maximally specific type.
The first part is easy. Simply instruct the
ColumnTransformer
meta-transformer to change modes using the
set_output
API.
The second part is harder, because SkLearn lacks a casting transformer.
Suggested workarounds include writing a cast function and wrapping it into the
FunctionTransformer
transformer, or developing a custom cast transformer class from scratch.
Neither option is particularly appealing due to productionization challenges.
The
sklearn2pmml
package provides the
sklearn2pmml.preprocessing.CastTransformer
transformer that meets all the requirements.
General recipe
: cast continuous columns to
numpy.float32
, and cast categorical columns to
pandas.CategoricalDtype
.
When dealing with complex data types or disjoint value spaces, apply the
CastTransformer
to one column at a time.
from sklearn.compose import ColumnTransformer
from sklearn2pmml.preprocessing import CastTransformer
transformer = ColumnTransformer(
# Cast continuous columns together
[("continuous", CastTransformer(dtype = numpy.float32), continuous_cols)] +
# Cast categorical columns one by one due to disjoint value spaces
[(categorical_col, CastTransformer(dtype = "category"), [categorical_col]) for categorical_col in categorical_cols]
)
transformer.set_output(transform = "pandas")
XGBoost estimators can now collect all feature descriptions from Pandas' dataframes automatically, making the
feature_types
parameter obsolete.
Instead, you are required to confirm your desire for categorical splits by passing
enable_categorical = True
:
from xgboost import XGBClassifier
classifier = XGBClassifier(enable_categorical = True)