Last semester I took a course on deep learning and data science. With this knowledge, I decided it was time for me to try my first Kaggle Competition! I chose the Santander challenge because the timeline was brief and the task seemed relatively simple: binary classification. I did not expect to win any money, so I went into this just to learn through experimenting with different methods and reading kernels and discussions from the awesome Kaggle community. I placed in the top 7% and would like to tell people if they are on the edge of doing their first Kaggle, just do it! There’s a lot to gain and nothing to lose, except maybe some time… In this blog post, I will summarize some of the things I tried and my takeaways.
The description of the challenge: “In this challenge, we invite Kagglers to help us identify which customers will make a specific transaction in the future, irrespective of the amount of money transacted. The data provided for this competition has the same structure as the real data we have available to solve this problem.”
My framework for this challenge was to work on it a bit everyday, but not spend a crazy amount of time on it. As many Kagglers have pointed out, there is a lot to learn from others. So, the first thing I did was look up solutions other people have written up to previous competitions. I found that in general, people usually followed a process of:
-Exploratory Data Analysis and reading public kernels and Discussion Threads and related papers
-Data Pre-processing, sampling, augmentation, and Feature Engineering
-Experimenting with different models and custom architectures
-Hyper-parameter tuning using cross-validation
-Submit results to Kaggle to probe the test data
-Post public kernels and discussions to get feedback from others, with the key point being in not releasing all of the “magic”
-Repeat the above steps (not necessarily in order) and combine findings with a team to improve the leader board score
I did not have the time to do all of this extensively as I would have liked, but tried to do a little bit for each of the steps. In case you want to follow along, you can check out my github repo: https://github.com/AstroBoy1/santander.
So let’s begin with the data. The data consists of 200k train and 200k test rows. There are 200 variables or features. And we need to predict one column which is a binary value of 0 or 1. We have no information about what exactly we are predicting and what these features represent. This data set is relatively small and so I decided to explore it in Google Colab. One useful trick that some people might not know about is we can mount a google drive directory with “drive.mount(‘/content/gdrive’)”. Then we can easily access files such as for training and test.
First I check if there was any missing data. Which there wasn’t. I also checked the data types, which are float64s. So no categorical data and all numeric data.
With this in mind, I decided to look at the correlations between the features. Sometimes we may not want to keep correlated features in the model. In linear regression for example, this can mess up our ability to determine what the coefficients mean. Whether or not it’s useful for Kaggle depends.
We can see that the features are uncorrelated because the highest ones are only at 0.01. This is usually a good thing for model performance because uncorrelated features contain more information potentially.
We can see that the features are uncorrelated because the highest ones are only at 0.01. This is usually a good thing for model performance because uncorrelated features contain more information potentially.
Next, I wanted to see the distribution of targets. This is because we may want to treat data that is unbalanced differently than data that is balanced. So let’s just look at the percentage of the data that is labeled as 1 and 0.
About 90% of the training data is labeled as 0, and the other 10% is labeled as 1. This suggests that there is data imbalance, so using a metric such as accuracy to judge our results may not be the best. Because guessing all 0 for the training data, would get us 90% accuracy. This is why the competitions uses the metric Receiver Operating Characteristic Area under the Curve (ROC-AUC). It is important in any Kaggle competition or real-life problem for that matter to understand the metric that is being evaluated. The ROC curve is used to look at how a model does under varying thresholds. Because many models can output a probability, it is often up to the business to determine whether or not to classify it as a 0 or 1. If our model says it is class 0 with 55% probability, does that mean we classify it as class 0? There are certain cases such as cancer imaging in which doctors may want to ensure they have low false negatives. So where they think a patient has no cancer, but they actually have cancer. That is a bad situation. Whereas a credit card company may want to have lower false positives if they want more customers in the case of whether or not they think someone will default on a loan. ROC-AUC measures the true positive rate also called the recall or sensitivity on the y-axis and the false positive rate on the x-axis. Below is an example graph.
The true positive rate is the rate: TP / (TP + FN). The false positive rate is : FP / (TN + FP). So as we vary the threshold from 1 to 0, and we get different rates of true positive and false positives. At the beginning, our model doesn’t classify anything as positive, while on the other side, with a threshold of 0, our model classifies everything as positive, leading to a 100% true positive rate and 100% false positive rate.
This is particularly important in using callbacks such as for neural network training, where we can adjust parameters accordingly by looking at the validation ROC-AUC.
Next I wanted to look at the feature distributions. Because this would allow us to determine if there might be a need for some feature scaling, normalization, or standardization.
We can look at the feature distributions in a few different ways:
- The distribution of the 200 features in the training data overall and in terms of class 0 and class 1.
- The distribution of the 200 features in the test data vs training data.
What we want to see is something like the distribution of the features to look different between the classes. Also features that have high variance can be better because that means there is more “information” for our models to learn from. We may also be able to see right away any interesting things that pop about about the features if we have experience or are lucky.
Looking at the distribution of the 200 features in the training data overall, regardless of class below.
This graph is a bit cluttered, but we can tell that the values range from about -75 to 75. Also, some of the distributions look normally distributed. Features that are normally distributed are often random variables. Something like test scores is often normally distributed. Because the features aren’t linearly correlated and because the features seem to follow some Gaussian distribution. A good first classifier to try is a Gaussian Naive Bayes Classifier, which we will try in the modeling section. Let us now take a look at the features but for each class separately. With some we can see some slight differences such as variable 0.
With others, the difference looks to be less like variable 3.
These are things that models can generally decipher. Next, let us look at the feature distributions in the training data and compare it with the feature distributions in the test data.
Here we can see that variable 0 has more variance as it shows differences between the train and test.
While variable 3 seems like it almost matches the distribution between train and test which means understanding that feature well during training to predict the classes could help our models do well during test time too. It could also mean that the variable is useless if the class distributions are different between train and test because it would mean the features look the same regardless of class.
Next I look at the row-wise distributions. Interestingly enough, they also look like a bell curve.
Time Series Check
I had heard the previous Santander had contained time series data, so I decided to take a quick peak at the data to check. First I created a simple scatter plot, with the x-axis being the features, and the y-axis being the values. A time series data may expect to have some type of cyclic nature embedded with ups and downs like the stock market.
We get something that essentially suggests time does not play an element if the features were different snapshots of a variable at different times in order. We can also check the auto-correlation.
Here we see as we vary the lag up to 200, the maximum correlation is bounded by 0.2. Another reason the features do not seem to be time series.
Also, for fun I try out a simple LSTM on the dataset to finish up the time series test in the modeling section to come.
At this point, I did not know what else to explore, so I decided it was time to make some features.
Data Pre-processing and Feature Engineering
Some easy features to create are sum, mean, max, min, standard deviation, skew, kurtosis and median. Skew is a measure of symmetry and kurtosis is a measure of outliers. There were some brilliant ideas that other people came up with during the competition such as the use of unique counts of variables, one such kernel with a nice explanation: https://www.kaggle.com/cdeotte/200-magical-models-santander-0-920
This ended up being part of the “magic.” Essentially for each feature, count the number of times that value appears in the feature in both the train and the “real” test data set.
Here are some of our new features. I took inspiration from this kernel: https://www.kaggle.com/gpreda/santander-eda-and-prediction
There are various models we can try. Some I try to demonstrate that they don’t do well and shouldn’t be done in general. Others are very promising. The types I look at are unsupervised and supervised learning.
For unsupervised learning, I use:
- K-means Clustering
- Agglomerative hierarchical clustering (bottom up)
Unsupervised learning is good for exploring the data sometimes, and also when their is a lot of unlabeled data, usually not as useful for Kaggle competitions, and I’ll show why. These ones are the only ones in Sklearn that are fast on large data sets and where the user can easily specify the number of clusters.
For supervised learning, I use:
- Support Vector Machines (SVM)
- Quadratic Discriminant Analysis (QDA)
- Naive Bayes (NB)
- Gaussian Naive Bayes (GNB)
- XGBoost (XGB)
- LightGBM (LGBM)
- Neural Networks
The tree based gradient boosted methods XGB, LGBM, and Catboost are some of the most popular methods for tackling tabular supervised learning problems on Kaggle and getting good performance quickly without specifying a particular architecture such as with neural networks. Neural networks are the most powerful methods, being able to approximate almost any function, with the downside of having a large number of options for defining the architecture. Methods such as SVMs, QDA, NB, and GNB are more simple methods that can be used as a litmus test to determine the difficulty of the problem and whether or not we can make any quick assumptions about the data.
First we we will fit some base models to the data. Then I will talk about some of the ways to improve model performance through hyper-parameter tuning. Along with ensuring we maintain information variance while learning and testing through the use of cross validation.
We get 49.4% in group 0 and 50.6% in the other group. This occurs because kmeans tends to prefer even sized clusters. We can also scale the data, which helps usually because it uses euclidean distance, but in this case, it’s not worth the time/effort as the results are already abysmal.
Agglomerative Hierarchical Clustering
Unfortunately, Google Colab runs out of RAM when trying Sklearn’s hierarchical clustering, so I did not decide to do these, although I could have done it on my VM, I know it will be a waste of time.
Before beginning supervised learning, it’s always a good idea to think about how we want to train our data. If we were to train on all of the data we have, our model could easily overfit. So it’s best to only train on a portion of the data, leaving some of the data to use for testing purposes. We can randomly choose 80% of the data to train on and use 20% of the data to “validate” our learning results on. We repeat this with different model architectures and/or parameters and can choose the one with the best results on our validation set. But we don’t want to leave our random number generator to lead our model to overfit or underfit. The best way to do it is to always use multiple iterations of this using k number of folds for training and validation. The below figure shows 5 fold cross validation.
We pick one model and train it 5 times, and have it generate predictions 5 times. At the end, our model will have predictions for the entire training dataset. This is one way to do it at least. With an unbalanced dataset like this Santander dataset, where 90% of the labels are 0 and 10% of the labels are 1, we want to do stratified sampling. So the ratio of the classes should be roughly 9:1 in our train and validation folds; otherwise, our model will probably not learn how to differentiate class 0 from 1 as well as we want it to.
With 5 fold cross validation, we calculate the ROC-AUC 5 times. Then we can get a mean value and standard deviation assuming normally distributed errors which is the case according to the central limit theorem. If we do this for 10 models. And if we want to choose the best model, we want to use one with a high mean value and a low standard deviation. Usually we would choose the one with the highest mean value. But it’s possible to bootstrap our data and calculate the ROC-AUC for our models. Then we might see the ROC-AUC to be normally distributed. We could then choose one value to represent the ability of our model such as the expected ROC-AUC value. I have not seen people do this though, as this would be time-consuming and might only provide a marginal boost in score.
When we do 5 fold cross validation and decide we like model 1 the best. What we actually end up with is 5 different versions of that model trained on different data. We could just use all of the data to train one model, but once again, that may cause over-fitting more easily. Another option is to ensemble these versions somehow such as averaging their predictions.
To wrap up cross validation, I want to mention the last thing that I tried was nested cross validation. Nested cross validation isn’t used too often in Kaggle competitions where time is limited because the amount of time it takes is often not worth the boost in performance in comparison to another method such as feature engineering. I thought it was worth learning about and trying though.
In regular cross-validation our results are optimistic because we have chosen the best hyper-parameters based off of our validation data. Technically, we’d want another outer loop where we then test the models we chose the optimal hyper-parameters for on more data that hasn’t been seen. But this is very time-consuming because let’s say we do 5 folds on the outer loop and 5 folds on the inner loop. We will have to train one model with a set of hyper-parameters 25 times.
Hyperparameter Tuning Aside
I did four hyperparameter tuning methods.
Manual: Manually choose the hyper-parameters, good option in the beginning to get models up to speed and then help feature engineering by looking at what they are doing. Usually, not enough time to do this though in the long run.
Grid: Going through a grid of hyperparameters is more automated than manual, but usually the ordering is not encoded into how we search through the hyperparameters which may waste some time.
Random: Can be better than grid search as we will hit a wider variety of hyper parameters faster.
Bayesian: Use a probability model of the objective function to choose the next set of hyperparameters. It builds a surrogate objective model such as gaussian process. I ended up using upper confidence bound for the selection process. I provide reasonable bounds for my hyperparameters by looking through manuals and examples and finding the minimum and maximum hyperparameters people suggest and used.
Let’s begin with the simpler supervised algorithms: SVM, QDA, NB, GNB
Then move to the boosted trees: XGB, LGBM, and CB
And try out some Neural Networks.
To cut to the chase, I will just show the results for SVM, QDA, and NB.
SVM got 0.6 on the public test set, QDA got 0.646, and Naive Bayes got 0.624. For fun, I also tried random guessing and all 0s which both got 0.5.
Now let’s move onto GNB.
I performed 10 fold stratified cross validation with default parameters for GNB using Sklearn. The above is the ROC-AUC curves for all 10 folds (same figure at the beginning). It gets 0.89 mean ROC-AUC with very little standard deviation. This is incredible! Then with a quick training on all of the data and a submission, the results come in at 0.888. This does further show the Gaussian nature of the data, along with the independence. The main problem about GNB is the lack of hyper-parameters to tune to boost the performance.
Now I tried out the beloved Kaggle trio of XGB, LGBM, and CB. Using some naive parameters, I get 0.72 for XGB on the public test, 0.83 with LGBM, and 0.82 with CB.
Before diving deep into one method in particular, I also give a few different neural networks a try. LSTMs getting about 0.68 and a vanilla one layer NN getting 0.65.
With that in mind, I decided to focus my time on XGB, LGBM, CB, GNB, and neural networks.
But first, let me show my neural network architectures.
LSTMs are good for time series data, so I tried it on this dataset mostly because I’ve learned about them, but never tried them out. Assuming variable 0 is the first time step and variable 199 is the last time step. I tried one LSTM unit layer with 64 units. The training was super slow and the validation ROC-AUC was increasing from around 0.6 quite slowly. I also thought of shuffling the features into the LSTM, but with limited time, I decided not to move forward with it.
I start with some manual tuning for submission 1. And have some fun probing the test data.
Submission 2: I add in a parameter that scales the weights because of the class imbalance. Sometimes it helps out for XGBoost.
Submission 3: I up the number of rounds to 100 and get a score of 0.868
Submission 4, I up the rounds to 120 , get 0.872
Submission 5: I up the rounds to 200 and get 0.876.
Submission 6: Up the rounds again to 300 and increase the max depth and reduce the learning rate and see over-fitting finally. 0.873
Submission 7: I increase the rounds to 500 to make sure the model didn’t get stuck and get 0.87. So over-fitting more.
Submission 8: Having begun to over-fit it’s time to apply some parameters that help with that. Instead of manually doing this. I do Bayesian search. I set up a google cloud VM with a GPU and run it there because it takes more than 12 hours to do on Google Colab k80, so it’s not doable there. There’s a lot of parameters, but I have it do bayesian search on more of the relevant ones.
- Colsample_bylevel: Subsample ratio of columns for each split in each level
- Colsample_bytree: Add randomness
- Gamma: Control model complexity and over-fitting. Splits only occur if they reduce the loss function by greater than gamma.
- Max_delta_step: Helps convergence when predicting the right probability. Positive values make the steps more conservative.
- Max_depth: Control model complexity and over-fitting. Larger depth can lead to over-fitting.
- min_child_weight: Control model complexity and over-fitting. The minimum sum of weights for a child. Higher values can lead to under-fitting, lower values can lead to over-fitting.
- reg_alpha: L1 regularization feature selection
- reg_lambda: L2 regularization, stable solution but less robust than L1 (good discussion on l1 v. l2 http://www.chioka.in/differences-between-l1-and-l2-as-loss-function-and-regularization/)
- scale_pos_weight: Handling imbalanced dataset
- subsample: Add randomness
After several days of training: 0.8869. Below shows a table of the parameters and results.
I start off LGBM with Bayesian hyper parameter tuning:
I use these parameters, let it run for a while and the best public score I got was 0.897 after one day of training. It seemed like LGBM was able to get a good score faster than XGBoost
Then around this time, there were some people who posted public kernels getting 0.899 using LGBM. Better than what I got, so I used their parameters.
I use Catboost, the “latest and greatest.”
Here is an example of some params I got:
I also do Bayeisan, and it gets to 0.897 after stopping the training after a few days.
At this point in time, it seemed like people were stuck at 0.899. People were going in various directions. I thought about incorporating feature engineering, but I still felt like I hadn’t finished trying dense neural networks, so it was on to that. I was hoping, they would be able to capture more complex relationships.
Interestingly, I did not see many other people trying out Catboost, maybe because Catboost is more advantageous with categorical features, given it’s name, and it’s newer, so there’s less tutorials for it.
It seemed like these gradient boosted methods had reached their peaks. And since people weren’t really trying neural networks, I thought I’ll try it.
First hidden layer with 128 nodes, relu activation Dropout = 0.5 2nd hidden layer 64, relu Dropout = 0.5 3rd hidden layer 16 Sigmoid final dense layer Optimizer=‘sgd’, loss=‘binary_crossentropy’, lr=0.01 reduced learning rate per iters, nesterov momentum Num_epochs = 1000. Auc metric validation early stopping, standard scaled the data. Dropout to avoid overfitting. We want to reduce the learning rate as we get closer to the optimum. Interestingly enough, this can only get to 0.7-0.8 with some various changes. Intuitively, this is because the features aren’t really interacting with each other. But when changing the input dimensions, it works a lot better. This person posted his kernel during the competition, naturally, myself and everyone else took a look at the top scoring public NN at the time: https://www.kaggle.com/jotel1/nn-input-shape-why-it-matters
The one thing that boosted the score a lot is Input(shape=(200, 1)). A good question is why does the shape matter? What this does is essentially 1d convolution with the number of filters depending on the next layer. In this case, 16. With one relu activation and a squashing to a single output node. This model gets 0.897 ROC-AUC. Some ways to think about this are using 3200 trees with max depth of 1. Another way to think of this is, that it performs a linear transformation on the distributions of the features. Essentially trying to separate the distribution of the features of class 0 as far apart from those of class 1. Using RELU, the model can use 0 as the cutoff point. But since each feature seemed quite independent, one could have one linear transformation for each feature. So like 200 models for each feature, that can then be combined, which someone ended up doing successfully, but I did not have the time for.
Taking a break
I had been Kaggling everyday and needed a break. I had homework in other classes to catch up on, and other things outside of school to do. So this was essentially the halfway point. No one had really taken the lead. So I decided to stop kaggling for about 2 weeks.
Back to Kaggle
I come back to Kaggle to see people have broken 0.91. There are lots of posts about what people have been trying. Tons of different “insight” into the data. Some useful, some not. With only a couple of weeks to go, I figure I’ll try stacking my classifiers because with my lack of intuition, I can’t tell which insight is useful and people usually always stack on Kaggle to eek out a better score. Because this was my first competition, I didn’t really have the confidence to post any public kernels or add comments to a discussion, but in my next competition I will.
What is stacking and why stack? These were things I didn’t know before the competition, but concepts I learned, and tried to apply with varying degrees of failure. Stacking is essentially combining multiple models together to make a more powerful model than any individually. There are some good explanations online, that I won’t cover for now: http://blog.kaggle.com/2016/12/27/a-kagglers-guide-to-model-stacking-in-practice/
So I tried various ways of stacking with 2-3 levels. At each level, 3-100 models. I also tried stacking with nested cross validation, which was a bit tricky and time consuming. With one model at the end to get the final output. I mostly used XGB, LGBM, Catboost, GNBs, and neural networks. I manually did some data sampling and feature sampling for each of the models, despite the fact that the gradient boosting tree methods already have parameters for this. I heard this was a way to further diversify the models, which is a goal in the models in the same level. For the one model at the end I tried various models, including a simple logistic regression. But with only a couple weeks left, I didn’t really have time to optimize parameters for this pipeline. So I ended up using the best parameters found previously. Which was not ideal but that’s what I get for not kaggling for 2 weeks, but did improve the score by 0.01 to 0.901.
Last week of Kaggle
I considered using the manually created features such as min, max, etc. but people said they weren’t helping. I knew the “magic” was hidden in the kernels and discussions, but I know I won’t place in the top 5, so now I’m just sort of chilling and just having fun reading what other people are doing, although I should’ve been feature engineering.
Kaggle 2019 Santander Conclusion
The winners are verified! Time to take a look at what they did! Wow, very amazing stuff, a lot better than what I did! I will maybe write about it in another post.
Conclusions and what I learned
I learned about new ml algorithms such as LGBM and Catboost, tried out an LSTM for the first time, although I had learned about them. I also learned how to mix them together through stacking and the use of nested cross-validation. Most importantly, I learned the importance of feature engineering in Kaggle competitions. Even when we don’t know what the features are, and this makes it harder to do feature engineering, we must still attempt to do it in order to win. Thank you for reading this post and I hope you enjoyed it!