in-your-own-words t1_irmv7ca wrote on October 9, 2022 at 2:02 PM

Reply to comment by Street_Excitement_14 in [D] CSV File to training and testing split by redditnit21

My pedagogical method is more socratic and from an engineering perspective. I think discovering how to find the names of the mainstream tools, how to find the documentation, and how to learn to read, understand, and rely on it, is ultimately the most beneficial and empowering to the developer.

in-your-own-words t1_irmnurg wrote on October 9, 2022 at 12:59 PM

Reply to comment by redditnit21 in [D] CSV File to training and testing split by redditnit21

Yes, there are dozens of ways of doing it. I encourage you to figure it out yourself. If you can't design and implement your own test & evaluation experiments for ML, you will end up doing the world more harm than good by dabbling in it. The entire ML field suffers from extremely weak T&E, and lots of people just learning to stuff inputs into functions.

Some hints:

There may be functions within standard machine learning software libraries that produce train/test splits given tabular data input.
There may be functions that will produce a random permutation of rows of a table.
There may be functions that produce random permutations of numbers from 0 to N, where you specify N. If N is the number of rows in your table, you could create a new column of these random numbers and then sort the table on that column.
You may want to consider class imbalance in your dataset. If this is the case, apply your train/test split independently to class 1 and class 0 such that your resulting split contains the same proportion of 1 and 0 in both train and test partitions.
Consider using an outer crossvalidation approach, where you do your experiment for k different train/test splits. When you report your metrics, look at the distribution of each metric over k experiments. Report the median, interquartile range, 5th and 95th percentiles, and outliers for each metric over k experiment trials.
version control your code and tag the commit that produces the results you report. Include this tag or the commit hash with your reporting of results.

in-your-own-words t1_irmn9d4 wrote on October 9, 2022 at 12:54 PM

Reply to [D] CSV File to training and testing split by redditnit21

Randomly permute the rows of the table, and then take the first X% of them for training.