Is there any general format for label-inputs in scikit-learn datasets? I see it have list of labels for output in target_names. I want to follow scikit conventions and keep some data about labels in input vars (e.g. sex). Is there any convention for this allready? Something like this
>>> data_set.inputs["sex"]
{'male': 1, 'female': 0}
There no convention for storing categorical feature name information. You are free to do as you wish.
Alternatively you can just store the original data with original format and use DictVectorizer / FeatureHasher and LabelBinarizer on the fly when you need to build a model from the data.