It is an oft-repeated maxim that "ML is only as good as the data that it uses." For this reason, it is important not only to know where to look for data, but also what questions to ask when you find it.
In the social sector, accessing high-quality datasets can be challenging. At the same time, not all ML datasets consist of hundreds of thousands of examples. In fact, you may be able to use a small dataset and leverage existing “knowledge” from a mature ML model by reusing parts of existing models (e.g., TensorFlow Hub or through transfer learning technology, like Cloud AutoML. In addition, organizations may have access to more useful data than they realize, whether proprietary (e.g., historical evaluation forms), or datasets in the public domain (more on that below). You may also be able to find collaborators who are willing to share data for social good applications.
Wherever you find your data, consider the following questions before you use the dataset to train your ML system:
- Am I allowed to use the dataset? What are the licensing requirements? What are the policies or regulations (if any) governing use of this data? Will I be able to use the model trained on this data for my purposes (e.g., is commercial use allowed)?
- How well does this data represent the content I am trying to predict? Are there any biases in the data? Will use of this data lead to any unfair biases in my system? Is there enough metadata to understand the possible biases in the way that it was collected? Knowing that most datasets will never be perfect, what policies or constraints can I build into my model to mitigate unfair bias?
- What is the source of the data? Is the data source reliable?
- How was the data collected?
After answering these questions, think about whether the data has the features that you need for training or if the data is diverse or complete enough to help you answer our ML questions. The Data Preparation and Feature Engineering course below covers those topics in more detail.