Catboost for tabular data

Catboost is an open-source ML Gradient Boosted Decision Trees algorithm, it's name come from the terms “Category” and “Boosting.” It was developed by Yandex (Russian Google ) in 2017

Key attributes of Catboost:

ranking objective function
native categorical features preprocessing
model analysis
fastest prediction time
- 30-60x faster as documented by real companies
- on GPUs it is 50-100x times faster than XGBoost.
performs remarkably well with default parameters, significantly improving performance when tuned
utilising ideas such as Ordered Target Statistics from online learning, CatBoost considers datasets sequential in time and permutes them
- By creating the concept of artificial time 🕰️ CatBoost cleverly reduces Prediction Shift, inherent in the traditional Gradient Boosting models such as XGBoost and LightGBM.
8X faster inference than XGBoost
- build better trees 🌲 that result in better regularisation and speed, especially during inference

References

The Gradient Boosters V: CatBoost – Deep & Shallow
XGBoost? CatBoost? LightGBM? | Plank
When to Choose CatBoost Over XGBoost or LightGBM [Practical Guide]
Is CatBoost faster than LightGBM and XGBoost?
ICR - Identifying Age-Related Conditions | Kaggle
Tabular Data: Deep Learning is Not All You Need
When Do Neural Nets Outperform Boosted Trees on Tabular Data?
- TABPFN

Resources

Catboost

CatBoost: unbiased boosting with categorical features
CatBoost: A Deeper Dive | Kaggle
catboost_simple.py · optuna/optuna-examples
CatBoost - open-source gradient boosting library
CatBoost Github Repo

GBDT

Stochastic Gradient Boosting
Gradient Boost Part 1 (of 4): Regression Main Ideas
Ensembles: Gradient boosting, random forests, bagging, voting

BENEDICT NEO 梁耀恩

Catboost for tabular data

References

Resources