Skip to content

datasetFunctions

phs.datasetFunctions

Functions:

Name Description
loadData

Load the four datasets

readDataFromZipArchive

Read a Pandas dataframe from excel file FNAME

sanitiseData

Sanitise Raw Data instanceof(pandas.DataFrame).

saveData

Save the four datasets with filenames based on

splitData

Split dataClean into trainVal and test sets.

Attributes:

Name Type Description
fnamesFmt

fnamesFmt = ['{prefix}/xTrainVal.txt', '{prefix}/yTrainVal.txt', '{prefix}/xTest.txt', '{prefix}/yTest.txt'] module-attribute

loadData(prefix)

Load the four datasets xTrainVal,yTrainVal,xTest,yTest from filenames based on format fnamesFmt.

Convert the datasets to numpy.

readDataFromZipArchive(archive, fname)

Read a Pandas dataframe from excel file FNAME within ARCHIVE.

sanitiseData(dataRaw)

Sanitise Raw Data instanceof(pandas.DataFrame).

  1. Coerce Errors.
  2. Drop columns with more than 50% missing values.
  3. Drop rows with NA.
  4. Retrieve \(Y\) and \(X\) as numpy arrays.
  5. Transform \(Y\)'s... Here \(Y \in \{-1,1\}\) instead of \(\{0,1\}\).

saveData(prefix, xTrainVal, yTrainVal, xTest, yTest)

Save the four datasets with filenames based on format fnamesFmt.

splitData(dataClean, trainValFactor=0.8)

Split dataClean into trainVal and test sets.