datasetFunctions
phs.datasetFunctions
Functions:
Name | Description |
---|---|
loadData |
Load the four datasets |
readDataFromZipArchive |
Read a Pandas dataframe from excel file |
sanitiseData |
Sanitise Raw Data |
saveData |
Save the four datasets with filenames based on |
splitData |
Split |
Attributes:
Name | Type | Description |
---|---|---|
fnamesFmt |
|
fnamesFmt = ['{prefix}/xTrainVal.txt', '{prefix}/yTrainVal.txt', '{prefix}/xTest.txt', '{prefix}/yTest.txt']
module-attribute
loadData(prefix)
Load the four datasets
xTrainVal,yTrainVal,xTest,yTest
from filenames
based on format fnamesFmt
.
Convert the datasets to numpy.
readDataFromZipArchive(archive, fname)
Read a Pandas dataframe from excel file FNAME
within ARCHIVE
.
sanitiseData(dataRaw)
Sanitise Raw Data instanceof(pandas.DataFrame)
.
- Coerce Errors.
- Drop columns with more than 50% missing values.
- Drop rows with NA.
- Retrieve \(Y\) and \(X\) as numpy arrays.
- Transform \(Y\)'s... Here \(Y \in \{-1,1\}\) instead of \(\{0,1\}\).
saveData(prefix, xTrainVal, yTrainVal, xTest, yTest)
Save the four datasets with filenames based on
format fnamesFmt
.
splitData(dataClean, trainValFactor=0.8)
Split dataClean
into trainVal and test sets.