Online Seminar: Pile of Law: Learning Responsible Data Filtering from the Law and a 256GB Open-Source Legal Dataset (Peter Henderson)

August 23 @ 1:00 pm - 2:00 pm EDT

Speaker: Peter Henderson is a joint JD-PhD (Computer Science) candidate at Stanford University and Stanford Law School. He is also an OpenPhilanthropy AI Fellow, Graduate Student Fellow at the Regulation, Evaluation, and Governance Lab, and Technical Advisor at the Institute for Security + Technology.

Discussant: Simon Wallace is a PhD student at Osgoode Hall Law School, York University.

Abstract: One concern with the rise of large language models lies with their potential for significant harm, particularly from pretraining on biased, obscene, copyrighted, and private information. Emerging ethical approaches have attempted to filter pretraining material, but such approaches have been ad hoc and failed to take into account context. In this seminar, Peter Henderson will discuss an approach to filtering grounded in law, which has addressed the tradeoffs in filtering material by gathering a 256GB dataset of open-source English-language legal and administrative data, covering court opinions, contracts, administrative rules, and legislative records.

Paper: https://arxiv.org/abs/2207.00220

Data & model: https://huggingface.co/pile-of-law

