Ethics in Data Collection

Berhane Cole
6 min readAug 16, 2021
Javier Velo

To conclude this multi-part examination in Artificial Intelligence and Machine Learning we will look its engine — data. Recently, I received a notification from a Artificial Intelligence and Data Solutions corporation of an opportunity to partake in a study. This study presented speakers of AAVE (African-American Vernacular English) with the task of recording 200 voice samples in order to, according to the release I received:

“capture a broad cross-section of participants targeting various combinations of demographics, with the goal of ensuring that our customer’s services, and derived products, are equally representing a diverse set of end-users.”

My interest was piqued by this mission both as a software engineer as well as a Black American. That interest crested on the same questions: what is the overall scope of this project and what can be developed through this data collection. Naturally, I can imagine ways models built with a diversity of dialects can be attractive for companies for a variety of customer service and customer experience applications but there are also more nefarious technologies that could be developed with these models and as a Black man and developer propelled my interests and concerns.

Data collection, data categorization, data interpretation, data annotation, and many other permutations of data analysis are fundamental to the nature of machine learning. Taking this as fact raises the question of how developers get their hands on vast amounts of data. The above study represents a form of data collection that relies on a formal agreement between the data sourcing company and a compensated ”supplier.” Here agreements and compensation are straightforward. This is not always the case and such clarity can be rare. To have a thorough investigation of ethics and protections within data collection.

Trust

When anyone is online they are constantly existing as a consumer as well as a product. Personal search history and other user metadata is interpreted to further advertise to the user and as the user navigates more information is gathered to continually feed this loop. What underlies this continuous dance is a set of agreements between end-users and corporations. The access to technology is often subsidized by the technology’s access to its users.

Here we encounter the distinctions between compliance and ethics. Compliance is a party’s adherence to binding agreements. For Example, HIPAA (Health Insurance Portability and Accountability Act) is a consent agreement between patients and healthcare providers to protect the privacy of patients medical records and mitigate any unethical relationships between providers, insurance companies, pharmaceutical corporations etc. A provider is compliant with the letter of HIPAA.

Compliance and ethics should not be conflated. In the above example, while a healthcare provider is unable to disclose patients’ sensitive information, the USA’s issue with excessive prescription of opioid medication and bias towards certain pharmaceuticals shows the opportunity for unscrupulous but compliant behavior. The danger that this raises, besides the obvious societal impact, is the dwindling of consumer trust and safety. An example of this is the controversy of 23andMe selling genetic information to insurance companies who can analyze and adjust rates for consumers who have predispositions for certain ailments.

Facebook and Cambridge Analytica

Facebook is an obvious example of a corporation operating on a deficit of consumer trust specifically relating to dubious decision regarding data sourcing. The omnipresence and ambition of Facebook has exposed to may privacy and ethics controversies, a lasting one being the Cambridge Analytica Scandal. The germ of this issue was a million-dollar purchase of information from an academic related to Cambridge University that included a Facebook personality quiz which allowed access to a vast amount of the users information.

Facebook is on the hook for it’s lack of clarity in transparency and the aims and means of Cambridge Analytica forecast the possibilities of unethical data sourcing. As mentioned above, the internet explorer is both product and consumer. In order to use a platform like Facebook, our internet explorer goes into an agreement with Facebook to share their data and is compensated with access to their technology. There may be an inkling of understanding that the information provided will lead to advertisements or UX personalized for the user.

A realistic assumption is that the information provided would be handled by Facebook and other services opted-into such as Amazon or Google. The crisis of Cambridge Analytica was the personality data of 87 million users obtained by innocuous means to a firm users had not given permission to. The ethics of Cambridge Analytica’s usage of the dubiously sourced data is questionable but the crisis lies in Facebook’s failure to implement robust data protection and the societal dark patterns borne of big data run rampant.

Data Protection

Example of sorts of information data brokerage firm Acxiom collects

Data protection and data privacy are different things. Data privacy deals with access to sensitive information and data protection is concerned with policies that govern the use and lifecycle of such sensitive information. While machine learning is built on access to information it is imperative that technological progress and ease-of-use does not overtake the wellbeing of users. The spirit of the Hippocratic oath has been summarized as ‘do no harm.’ Information technology, as it stands in today’s landscape, is a fundamentally capitalist enterprise that cannot fully embrace such an oath but the sheer amount of harm that can be done with big data should compel regulation and oversight.

The European Union’s GDPR (General Data Protection Regulation) is amongst the most thorough documents relating protecting user’s information. It details rules for accountability, reporting, clarity, transparency and fines in cases of companies not complying with regulations. Importantly it gives user’s the right to view what information they offer is tracked and how that information is used. While the only true defense, regulation still has its limits. The GDPR is only applicable in the EU and as the resilience of Facebook and Google as too-big-to-fail monoliths illustrates, fines can only go so far. Only widespread consumer demand and paradigm shifts in corporate philosophy can truly mitigate usage of unethically sourced data.

Tenets of Ethically Sourced Data

In the introduction, I was curious about how the data solution company would use voice samples if I was to provide them with some. While an assurance of how the data would be used maybe would encourage my trust in the study, the overall process of being financially compensated for providing data is ethical enough for the circumstance. The release was opaque because likely the information would just be applied to a model and sold down the line. Still it is a clear agreement where a potential “supplier” can clearly opt-in and opt-out.

This sort of clarity is uncommon while traversing the internet. A user is engaged in multiple privacy and data policies at once and those policies are updated sometimes without the acknowledgment and approval of the end-user. The ease of the web combined with the size of the corporations we interact with online lead to an undue amount of responsibility on the user. Corporations must shoulder that burden and maintain principles to maintain and foster consumer trust. Here are some examples:

  • Transparency in data collection: what is collected, how is it collected, for what purpose, and for whom
  • Collecting the Bare Minimum: Only collect what is needed and what you have use for. Collect as little as necessary to protect from liability
  • No PII: Avoid sharing Personally Identifiable Information
  • Consent Must Be Clear: Consent changes over time, all changes to policies must be updated and users notified and extrapolations of consent should not be made, only expressed consent is viable.
  • Clarity is Paramount: Policies should be written clearly to ensure consumer understanding and consent to data policies.

Sources:

--

--