Data Collection and Preprocessing

January 28, 2018

Transcript of the video

Accio Data!

Data collection is one of the most important parts of building machine learning models. Because no matter how well designed our model is, it won't learn anything useful if the training data is invalid.

It's garbage-in, garbage-out! Invalid data lead to invalid results. This is not to say that the training data needs to be perfect. Especially if you have millions of samples, it's inevitable that some samples will have inaccurate labels. That can be tolerable. But if there is a serious flaw in our data collection strategy, we might end up with data that is complete garbage.

What we mean by 'garbage' can mean several things. For example, the labels can be inaccurate, or the input variables can be inaccurate or missing. These ones are usually obvious and easy to notice after a quick inspection.

A sneakier type of flaw is the dataset bias. When people, or companies, use data-driven models for decision making, they sometimes assume that it cannot be biased since it's the machines that make the decision, and the decision is based on 'big data'. The problem is that the data can be biased, no matter how big it is. You might have heard about Tay, a chatbot that started to post racist Tweets after being left unsupervised for a while. It's highly unlikely that that behavior was intentional, but human biases sometimes surface themselves in models trained on data collected from humans. Racism in racism out. Sexism in sexism out. One should be extra careful when building models that affect people's lives, such as the models that are used for medical, financial, or legal purposes. Using biased data to make decisions can further reinforce the biases and lead to unfair discrimination.

Ok, that being said, let's talk about where to find data and what to do with them. Let's start with the cheapest and easiest option. If you are lucky, you might not have to collect any data at all. There are many datasets that are freely available on the web. Unless you want to build a model that focuses on a niche topic, it's likely that you will find the dataset you want by doing a simple web search.

Let's move to the next cheapest option: web crawling and scraping. The internet is an immense source of information. For example, the entire Wikipedia is available for download. Content available on Wikipedia can be used both as-is or as a starting point to collect more data. Natural Language Processing models, in particular, would benefit greatly from this data as a large corpus of text in a particular domain. Other applications, such as image classifiers, can also benefit from data available on Wikipedia. For example, if your goal is to classify dog breeds, you can pull a list of dog breeds from Wikipedia, iterate over its rows, and find and download pictures of each breed on the web. Some content providers make this easier by providing an official API that gives programmatic access to the content. Web crawling and scraping tools, such as the Selenium WebDriver and a headless browser, can also be used to build such custom datasets. One thing to keep in mind about web scraping is that make sure that you are complying with the terms of services of the websites that you scrape content from. Also, make sure to add delays between web requests to avoid putting too much load on the servers that you pull data from.

In some cases, these 'easy' data collection techniques might not be applicable, particularly if our tasks need some subjective human judgment. For example, if our goal is to predict how a human would rate the quality of a given image, then we might need a set of images that are rated by humans. Actually, for this specific example, there are publicly available datasets. But my point is that at some point we might need to ask people to provide us some data. One way of doing that is to conduct surveys. If you can structure your study in the form of a game, that's even better. What would you prefer: to play a game or a fill out a survey? Games are a great way to collect data. You can also utilize crowdsourcing platforms such as Amazon Mechanical Turk. In any case, if you need human input to build a dataset, pay attention to ask unbiased questions, make the user interface easy to use, and make it fun for the subjects. More importantly, ensure that the entire process is ethical and be extra cautious if you are dealing with sensitive data.

Now that we know where to find data, let's assume that we already have the data, and talk about how to make it ready for machine consumption. We mentioned missing data earlier. It's best to avoid missing values during data collection, but that might not always be an option. So what do we do if we have some missing values here an there? If it's only a small portion of the samples that have the missing values, we can simply discard them. If it's some particular data fields that have a lot of missing values, we can drop those columns.

Another option is data imputation. For time-series data, the last valid value is sometimes carried forward to fill in missing values. If this is a global stock market dataset, for example, there might be some missing values for companies in different countries, since the stock market might be closed due to national holidays. Other basic data imputation methods include substituting the missing values with the mean or the median of the column. Some more sophisticated methods predict the missing values from what's available by using another learned model. One caveat about missing values is that they might not always be random. A missing value can have a meaning on its own too. There might be a particular reason why some fields are empty, and filling in these fields might lead to a bias in the dataset. If it's a categorical variable that is missing some values, it's sometimes best to treat missing data as just another category.

Fixing a dirty dataset takes a lot of time. Cleaning a dataset usually involves manual inspection and corrections in addition to automatic processes. If you are at the beginning of the data collection process, it would be the best to identify the underlying problems and revise the data collection strategy to prevent missing or inconsistent data in the first place when possible.

Having a clean dataset is not always enough to train useful models. There might still be some issues. For example, if the input variables are in different orders of magnitude, features having larger magnitudes can dominate features having smaller magnitudes during training. One solution to that is feature scaling, which is usually done to achieve consistency in the dynamic range of the variables. Scaling the variables properly can improve the results and speed up the convergence. A very simple way to scale a variable is to scale them to a specific range linearly. For example, we can scale all ages to the range [0, 1] by simply subtracting the minimum age we have in the dataset from the values and dividing the result by the difference between the maximum and minimum values. Another widely used method is standardization, which makes the variables zero mean and unit variance, preventing one variable with a large variance from dominating an objective function. This is simply done by subtracting the mean from each feature and dividing it by its standard deviation.

One more thing that is worth mentioning is data imbalance. Certain classes in a dataset can sometimes have a relatively smaller number of samples compared to the others. As a result, the learning algorithm might choose to ignore those underrepresented classes. For example, if you have a dataset of animal species, the pictures of cats might outnumber the pictures of tigers. As a result, the learning algorithm might choose to classify all felines as cats and get away with it if we use a uniform cost function.

In such a case, we have a few options. The first option is to leave it as-is. It might be acceptable to classify the samples that are less likely to be observed in the future with lower accuracy. If we think that all classes are equally important, one option is to undersample the larger classes by throwing out some of their samples. Personally, I don't recommend this. Another option is to oversample the underrepresented classes or to synthesize fake examples for these classes. I have seen this technique to be useful in traditional machine learning systems, but never really used it for training deep models on large scale data.

Perhaps, a better option for deep learning models is to use a class-weighted cost function, where a higher cost is assigned to the misclassification of the underrepresented classes. These weights might either be hard-coded or dynamically adjusted based on observed frequencies of the samples. For example, we can assign a higher weight to axolotls to compensate for their rare occurrence in the dataset. Axolotls are awesome, by the way.

The topics we covered today are usually discussed in the realm of data mining and often not discussed much in the context of deep learning. It's true that deep learning models can be robust against some sorts of noise and learn useful models even from not so clean data. But, personally, I consider data collection and inspection one of the most important steps when building models. If a model doesn't work well, my experience is that the culprit is more likely to be the data than the hyperparameters.

That brings us to the end of this video, except for the bloopers at the end of the video. In the next video, we are going to shift gears and talk about convolutional neural networks, finally! Convolutional neural networks are amazing, and we will see why.

As always, thanks for watching, stay tuned, and see you next time.