Internal-Diet-514

Internal-Diet-514 t1_j6oujtg wrote

Time series tabular data would have shape 3 (number of series, number of time points, # of features). For gradient boosted tree models isnt the general approach to flatten the space to (number of series, number of time points X # of features). Where as a cnn would be employed to extract time dependent features before flattening the space.

If there’s examples that boosted tree models perform better in this space, and I think you’re right there are, than I think that just goes to show how traditional machine learning isnt dead, but rather if we could find ways to combine it with the thing that makes deep learning work so well (feature extraction) it’d probably do even better.

−4

Internal-Diet-514 t1_j6oi9qu wrote

If we’re considering the dimensions to be the number of datapoints in an input than I’ll stick to that definition and use the shape of the data instead of dimensions. I don’t think I was wrong to use dimensions to describe the shape of the data but I get that it could be confusing because high dimensional data is synonymous with a large number of features, whereas I meant high dimensions to be data with shape > 2.

Deep learning or CNNs are great because of its ability to extract meaningful features from data with shape > 2 and then pass that representation to an mlp. But the feature extraction phase is a different task than what traditional ml is meant to do, which is to take a set of already derived features and learn a decision boundary. So I’m trying to say a traditional ml model is not super comparable to the convolutional portion (feature extraction phase ) of a cnn.

−3

Internal-Diet-514 t1_j6nzvcc wrote

When talking about dimensions I meant (number of rows, number of features) is 2 dimensions for tabular data. (Number of series, number of time steps, number of features) is 3 dimensions for time series and (number of images, width, height, channels) is 4 dimensions for image data. for deep learning classification, regardless of the number of dimensions it originally ingests it will become (number of series, features) or (number of images, features) when we get to the point of applying an mlp for classification.

You could consider an image to have width x height x channels features but thats not what a CNN does, the cnn extracts meaningful features from the high dimensional space. The feature extraction phase is what makes deep learning great for computer vision. Traditional ML models don’t have that phase.

0

Internal-Diet-514 t1_j6nep37 wrote

Deep learning is only really the better option with higher dimensional data. If Tabular data is 2D, time series is 3D and image data 4D (extra dimension for batch) than deep learning is really only used for 3D and 4D data. As others have said tree based models will most of the time outperform deep learning on a 2D problem.

But I think the interesting thing is the reason we have to use deep learning in the first place. In higher dimensional data we don’t have something that is “a feature” in the sense that we do with 2D data. In time series you have features but they are taken over time so really we need a feature which describes that feature over time. That’s what CNNs do. CNNs are feature extractors and at the end of the process almost always put that data back into 2D format (when doing classification) which is sent through a neural net, but it could be sent through a random forest as well.

I think it’s fair to compare a neural network to traditional ML but when we get into a CNN thats not really a comparison. A CNN is a feature extraction method. The great thing is that we can optimize this step by connecting it to a neural network with a sigmoid (or whatever activation) output.

We don’t have a way to connect traditional ML methods with a feature extraction method in the way you can with back propagation for a neural net and a CNN. If it’s possible to find a way to do that, maybe we would see a rise in the use of traditional ML for high dimensional data.

8

Internal-Diet-514 t1_j5vbxjl wrote

I would try without data augmentation first. You need a baseline to understand what helps and what doesn’t to increase performance. If there is a strong signal that can differentiate between the classes, 100 images may be enough. The amount of data you need is problem dependent it’s not a one size fits all. As others have said make sure youre splitting into train and test sets to evaluate performance and that each has a distribution of classes similar to the overall population (matters if you have an imbalanced dataset). Keep the network lightweight if you’re not using transfer learning and build it up from there. At a certain point it will overfit but it will most likely happen faster the larger your network is.

2

Internal-Diet-514 t1_j07w3r6 wrote

Depends on the range of min and max values for every other sample in the dataset. For instance if one of your samples ranges from (0-12) and most others from (0-64) you would be missing out on the fact that 12 was actually a pretty low value comparative to other observations as it would be set to 1 for that sample.

1

Internal-Diet-514 t1_j07s3t2 wrote

I think that’s why we have to be careful how we add complexity. The same model with more parameters will overfit quicker because it can start to memorize the training set, but if we add complexity in its ability to model more meaningful relationships in the data tied to the response than I think overfitting would still happen, but we’d still get better validation performance. So maybe VIT for cifar-10 didn’t add any additional capabilities that were worth it for the problem, just additional complexity.

1

Internal-Diet-514 t1_j07qmb0 wrote

I agree with you, it’s just now a days when people say they have created an architecture that outperforms some baseline they really means it outperforms some baseline on image net or cifar or some other established dataset. All data is different and I really think the focus should be what added ability does this architecture have to model relationships between the input data that a baseline doesn’t and how does that help with this specific problem. Which is why the transformer was such a great architecture to begin with for NLP problems because it demonstrated the ability to model longer range dependencies over an LSTM like architecture. I’m just not sure it translated well to vision when we begin to say it’s better than a pure CNN based architecture.

5

Internal-Diet-514 t1_j07pfk6 wrote

On your first paragraph when you say given the same amount of data isn’t it shown here that the VIT was given more data as it was trained with other datasets as well, before being fine tuned on cifar-10? And then compared to other models which were most likely trained on cifar-10 alone? I guess my worry is if we’re going to do a proper comparison between models that they should all follow the same training procedure. You can reach SOTA performance on a dataset using other techniques rather than architecture alone.

2

Internal-Diet-514 t1_j073xp9 wrote

Stuff like that always makes me wonder. I mean if they had to train it on several other datasets before training it on CIFAR-10, isn’t it a worse architecture (for the specific problem) than one that performs well trained from scratch on CIFAR-10? And if that model followed the same training procedure as the VIT I wonder if it would beat it.

5

Internal-Diet-514 t1_iymjci2 wrote

If a model has more parameters than datapoints in the training set it can quickly just learn the training set resulting in an over-fit model. You don’t always need 16+ attention heads to have the best model for a given dataset. A single self attention layer with one head still has the ability to model more complex relationships among the inputs than something like arima would.

2

Internal-Diet-514 t1_itwdhg2 wrote

To start I’d down sample the number of images that don’t have any mass in them (or upsample the ones with mass) for the training data while keeping an even balance in the test/ validation. Others have said above that the loss functions is better suited to see an even representation. This is an easy way to do it without writing a custom data loader and you can see if that’s the problem before diving deeper.

2