LightGBM

Workflow

Ordinal encoding

General recipe : pass continuous columns through unchanged, and encode categorical columns to integers.

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OrdinalEncoder

transformer = ColumnTransformer([
    ("cont", "passthrough", cont_cols),
    ("cat", OrdinalEncoder(), cat_cols)
])

This column transformer produces a homogeneous NumPy array.

Just like XGBoost estimators , LightGBM estimators in their default configuration would assume that all columns are continuous columns, and partition them using continuous splits.

Override this behavious by passing the categorical_feature parameter. The appropriate value is a list of categorical columns references, either as integer column indices or string column names.

The categorical_feature parameter is a fit(X, y) method parameter instead of a constructor parameter. When the LightGBM estimator is embedded in a pipeline, pass the categorical_feature parameter (together with any helper parameters) via the Pipeline.fit(X, y, **params) method using the "parameter path syntax" ( <step_name>__<parameter_name> = <value>).

Generating and passing categorical_feature as integer column indices:

from lightgbm import LGBMClassifier
from sklearn.pipeline import Pipeline

categorical_feature = [len(cont_cols) + categorical_idx for categorical_idx in range(0, len(cat_cols))]

classifier = LGBMClassifier(random_state = 42)

pipeline = Pipeline([
    ("transformer", transformer),
    ("classifier", classifier)
])
pipeline.fit(X, y, classifier__categorical_feature = categorical_feature)

Generating and passing categorical_feature as string column names:

from lightgbm import LGBMClassifier
from sklearn.pipeline import Pipeline

feature_name = cont_cols + cat_cols
categorical_feature = cat_cols

classifier = LGBMClassifier(random_state = 42)

pipeline = Pipeline([
    ("transformer", transformer),
    ("classifier", classifier)
])
pipeline.fit(X, y, classifier__feature_name = feature_name, classifier__categorical_feature = categorical_feature)

Passing categorical_feature as string column names is more robust, because the feature_name helper parameter guarantees proper alignment with NumPy array dimensions.

Column names can be real or synthetic. They appear in LightGBM model files, but not in the PMML representation of SkLearn LightGBM pipelines, as the PMML converter independently collects and maintains feature descriptions.

Casting

General recipe : pass continuous columns through unchanged, and cast categorical columns to pandas.CategoricalDtype.

SkLearn lacks a dedicated casting transformer. Some SkLearn transformers like OrdinalEncoder do have a dtype attribute, but it is meant for optimizing the numeric data type of output NumPy arrays not general data type casting.

The sklearn2pmml package provides the sklearn2pmml.decoration.CategoricalDomain decorator and the sklearn2pmml.preprocessing.CastTransformer transformer for this purpose. Select the one that better suits the pipeline configuration and objectives.

The CategoricalDomain decorator must be positioned first in the pipeline. It handles a wide variety of data ingestion tasks, including casting user-provided data to the desired data type, performing value validity checks, treating missing and invalid values, collecting metadata, etc.

In contrast, the CastTransformer transformer can appear anywhere in the pipeline. It specializes on data type conversions only.

Casting to CategoricalDtype differs from ordinary casts (eg. from int to float, or vice versa), because it requires backing from the underlying data container. Specifically, Pandas' extension data types are not interoperable with NumPy arrays or matrices. For the casting to have the desired effect, collect the results into a Pandas' series or dataframe, and keep them so until the final estimator step of the pipeline.

The ML framework typically takes responsibility for data container management.

Most SkLearn transformers support both NumPy and Pandas data containers. Interconversions can happen at the meta-estimator level, when collecting and aggregating child transformer outputs. The default behaviour is downcasting Pandas' data containers to NumPy arrays, and downcasting Pandas' extension data types to NumPy object data type. Override this using the set_output API.

Calling set_output(transform = "pandas") on either the child column transformer or the parent pipeline "fixes" the current workflow.

from lightgbm import LGBMClassifier
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn2pmml.preprocessing import CastTransformer

transformer = ColumnTransformer(
    [("cont", "passthrough", cont_cols)] +
    [(cat_col, CastTransformer("category"), [cat_col]) for cat_col in cat_cols]
)
transformer.set_output(transform = "pandas")

classifier = LGBMClassifier(random_state = 42)

pipeline = Pipeline([
    ("transformer", transformer),
    ("classifier", classifier)
])
pipeline.fit(X, y)

The LightGBM estimator can now detect categorical columns by the CategoricalDtype data type. There is no need for messing with fit params anymore.

The lightgbm package has been relying on the pandas package for categorical data support since its earliest versions.

LightGBM estimators are more robust than XGBoost estimators, because they encode the effective mapping between category levels and category indices into LightGBM model files as a pandas_categorical section (cf. an external feature map file). However, the primary index on this mapping may be unstable if the training dataset contained unused features.

Resources

Notebook: View or Download