One hot encoding
Say we have a categorical variable in our dataset such as occupation:
To pass these into a machine learning algorithm we need to convert them to numeric values. We could just assign a numeric label, for example teacher = 1, office worker =2 and so on. But some algorithms could interpret this as office worker > teacher which makes no sense. One way to get around this problem is to use one hot encoding. Then we create vectors to identify the categorical value, for example
Teacher = 100
Office worker = 010
Accountant = 001
But one-hot vectors would increase the dimensionality however it is possible to use some dimensionality reduction like PCA. Note that if our categorical variable has many unique values then, we may want to use a more sparse encoding.