deep & cross networks (DCN)

March 31, 2025


i woke up too late again. my alarm went off at 7:30, but i woke up at 9 am. this always happens. it's been happing for a week. i always fail to wake up early enough.

a few questions i couldn't answer when A mock interviewed me

  • how does logistic regression work?
  • what is MLE?
  • type i vs type ii error
  • can values in pdf > 1?
  • what does independence between two variables mean?
  • does independence mean covariance is 0?

studied DCN today. let's start from the basics. we have features for our models to make predictions, i.e. ad click predictions. but these individual features don't act independently, their combinations can have a unique effect - interactions.

ex: the combination of user_city = 'SF' AND ad_category = 'AI' AND time_of_day = 'morning' strongly predicts a click rather than each of them individually

how to create new features? feature crossing. i.e. city_X_ad_category

why? this helps us explicitly capture interactions and add non-linearity

you can cross catXcat, numXnum, catXnum, or even higher order cross (three or more features)

we know Deep neural networks learns these interactions implicitly. the multiple layers and ReLU can approximate complet functions. however they might be inefficient to learn specific simple multiplicative crosses, requiring many neurons or layers to learn.

introducing: deep & cross network. the cross network is part of DCN, specifically designed to create these feature crosses explicitly.

the formula: Xl+1=x0βˆ—Wlβˆ—xl+bl+xlX_{l+1} = x_0 * {W_l * x_l + b_l} + x_l is mathematically structured to compute interactions between the original input features (x_0) and the representations learned so far (x_l).

An L-layer cross network explicitly models all feature interactions up to l + 1

this is the magic of DCN. instead of manually creating all possible crosses up to a certain order, the cross network learns which combinations are important. and you can choose L, which controls the max degree of interactions.

and DCN's have been evolving

  • from Wide&Deep models: exhaustively computes all possible crosses
  • DCN replaces manual engineering of cross features
    • two free parameters, the weight vector w and the bias vector b.
    • plateaus after 2 layers
  • DCN-v2 replaces crossing vector w with crossing matrix W
    • model bit-wise interactions interactions of certain positions (β€œbits”) of the feature vector
    • stack 4 layers or more and still see improvements
  • DCN-mix replaces crossing matrix W with a mixture of low-rank experts + gating network
  • Residual DCN (LinkedIn's LiRank): Built upon DCN V2 by adding attention mechanisms within the (low-rank) cross network and specific skip connections to further enhance interaction modeling and training stability for their specific large-scale systems
    • also instead of high-dim one-hot encoded sparse features as input, it uses lower-dimensional embedding lookups, further reducing the input dimension and parameter count
  • Gated Cross Network adds an information gate on top of each cross layer in DCN-v2 which controls the weight each model should assign to each feature interaction, preventing overfitting to noisy feature crosses
  • DCN-v3:
    • ensures interpretability with two explicit sub-networks: Linear and Exponential cross networks
    • Self-Mask operation to filter noise and reduce the number of parameters in the Cross Network by half
    • fusion layer, multi-loss trade-off and calculation -> Tri-BCE, to provide appropriate supervision signals

here's a github repo implement them in pytorch

but when are they used in the multi-stage ranking system? they're often used in the l2 or final (reranking) stage.

the architecture is usually:

  1. Candidate generation (retrieval): simple and fast methods (embedding, simple collaborative) to select 100-1000 from millions or billions (focus on recall (not missing good candidates) and speed)
  2. Ranking (first-pass): moderately complex models to score (logistic regression, GBDT) to select 50-100. (focus on precision and speed)
  3. Reranking (fine-ranking): most powerful model (DCN) the focus is on precision

some metrics important in recysys

precision: TP / (TP+FP) : out of all predicted positive, how many were actually positive (how many of our recommendations were actually relevant?) recall: TP / (TP + FN) : out of all actual positive, hoa many did we identify as positive? (of all truly relevant shows, how many did we recommend?)

false positives? focus on precision (be as precise about your positives predictions as you can) false negatives? focus on recall (catch as many positives as possible)

in recsys, we often focus on the top recommendations, since users typically only look at the first few suggestions

  • precision@k: look at top k recs, how many are actually relevant?
    • if 6 are relevant out of top 10, precision@10 = 0.6
  • recall@k: out of all relevant items, how many of your recommendations are in top k
    • 20 relevant movies, 6 of your movies are in top 10, your recall @10 = 6/20 = 0.3

so precision is about how accurate your top k recommendations are, and recall is how much coverage your recommendations have


linkedin papers

related