OpenAI warns AI behind GitHub’s Copilot may be susceptible to bias

Where does your enterprise stand on the AI adoption curve? Take our AI survey to find out.


Last month, GitHub and OpenAI launched Copilot, a service that provides suggestions for whole lines of code inside development environments like Microsoft Visual Studio. Powered by an AI model called Codex rained on billions of lines of public code, the companies claim that Copilot works with a broad set of frameworks and languages and adapts to the edits developers make, matching their coding styles.

But a new paper published by OpenAI reveals that Copilot might have significant limitations, including biases and sample inefficiencies. While the research describes only early Codex models, whose descendants power GitHub Copilot and the Codex models in the OpenAI API, it emphasizes the pitfalls faced in the development of Codex, chiefly misrepresentations and safety challenges.

Despite the potential of language models like GPT-3, Codex, and others, blockers exist. The models can’t always answer math problems correctly or respond to questions without paraphrasing training data, and it’s well-established that they amplify biases in data. That’s problematic in the language domain, because a portion of the data is often sourced from communities with pervasive gender, race, and religious prejudices. And this might also be true of the programming domain — at least according to the paper.

Massive model

Codex was trained on 54 million public software repositories hosted on GitHub as of May 2020, containing 179 GB of unique Python files under 1 MB in size. OpenAI filtered out files which were likely auto-generated, had average line length greater than 100 or a maximum greater than 1,000, or had a small percentage of alphanumeric characters. The final training dataset totaled 159 GB.

OpenAI claims that the largest Codex model it developed, which has 12 billion parameters, can solve 28.8% of the problems in HumanEval, a collection of 164 OpenAI-created problems designed to assess algorithms, language comprehension, and simple mathematics. (In machine learning, parameters are the part of the model that’s learned from historical training data, and they generally correlate with sophistication.) That’s compared with OpenAI’s GPT-3, which solves 0% of the problems, and EleutherAI’s GPT-J, which solves just 11.4%.

After repeated sampling from the model, where Codex was given 100 samples per problem, OpenAI says that it manages to answer 70.2% of the HumanEval challenges correctly. But the company’s researchers also found that Codex proposes syntactically incorrect or undefined code, invoking functions, variables, and attributes that are undefined or outside the scope of the codebase.

GitHub Copilot

Above: GitHub Copilot

More concerningly, Codex suggests solutions that appear superficially correct but don’t actually perform the intended task. For example, when asked to create encryption keys, Codex selects “clearly insecure” configuration parameters in “a significant fraction of cases.” The model also recommends compromised packages as dependencies and invoked functions insecurely, potentially posing a safety hazard.

Safety hazards

Like other large language models, Codex generates responses as similar as possible to its training data, leading to obfuscated code that looks good on inspection but in fact does something undesirable. Specifically, OpenAI found that Codex, like GPT-3, can be prompted to generate racist, denigratory, and otherwise harmful outputs as code. Given the prompt “def race(x):,” OpenAI reports that Codex assumes a small number of mutually exclusive race categories in its completions, with “White” being the most common followed by “Black” and “other.”  And when writing code comments with the prompt “Islam,” Codex often includes the word “terrorist” and “violent” at a greater rate than with other religious groups.

OpenAI recently claimed it discovered a way to improve the “behavior” of language models with respect to ethical, moral, and societal values. But the jury’s out on whether the method adapts well to other model architectures like Codex’s, as well as other settings and social contexts.

In the new paper, OpenAI also concedes that Codex is sample inefficient in the sense that even inexperienced programmers can be expected to solve a larger fraction of problems despite having seen fewer than the model. Moreover, refining Codex requires a significant amount of compute — hundreds of petaflops per day — that contributes to carbon emissions. While Codex was trained on Microsoft Azure, which OpenAI notes purchases carbon credits and sources “significant amounts of renewable energy,” the company admits that the compute demands of code generation could grow to be much larger than Codex’s training if “significant inference is used to tackle challenging problems.”

Among others, leading AI researcher Timnit Gebru has questioned the wisdom of building large language models, examining who benefits from them and who’s disadvantaged. In June 2020, researchers at the University of Massachusetts at Amherst released a report estimating that the amount of power required for training and searching a certain model involves the emissions of roughly 626,000 pounds of carbon dioxide, equivalent to nearly 5 times the lifetime emissions of the average U.S. car.

Perhaps anticipating criticism, OpenAI asserts in the paper that risk from models like Codex can be mitigated with “careful” documentation and user interface design, code review, and content controls. In the context of a model made available as a service, like via an API, policies including user review, use case restrictions, monitoring, and rate limiting might also help to reduce harms, the company says.

“Models like Codex should be developed, used, and their capabilities explored carefully with an eye towards maximizing their positive social impacts and minimizing intentional or unintentional harms that their use might cause. A contextual approach is critical to effective hazard analysis and mitigation, though a few broad categories of mitigations are important to consider in any deployment of code generation models,” OpenAI wrote.

We’ve reached out to OpenAI to see whether any of the suggested safeguards have been implemented in Copilot.

VentureBeat

VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact.

Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:

  • up-to-date information on the subjects of interest to you
  • our newsletters
  • gated thought-leader content and discounted access to our prized events, such as Transform 2021: Learn More
  • networking features, and more

Become a member

Leave a Comment