Description: This is a bulk open-access dataset in JSON, parquet and Hugging Face dataset formats with the full text of Federal Court (Canada) decisions. The process through which data is processed and code snippets for loading the data are available in a repository on the Refugee Law Lab GitHub.
Data: https://github.com/Refugee-Law-Lab/fc_bulk_data/tree/master/DATA/YEARLY
GitHub Repository: https://github.com/Refugee-Law-Lab/fc_bulk_data
Hugging Face Repository: https://huggingface.co/datasets/refugee-law-lab/canadian-legal-data
Current Coverage: 2001 – Present (* cases with neutral citation)
Number of Decisions: ~60,000
Languages: English & French
Format: JSON (yearly files for each language), Parquet, Hugging Face Dataset
License: Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0). NOTE: Users must also comply with upstream licensing from the FC data source, as well as requests on source urls not to allow indexing of the documents by search engines to protect privacy. As a result, users must not make the data available in formats or locations that can be indexed by search engines.
Citation: Sean Rehaag, “Federal Court Bulk Decisions Dataset” (2023), online: Refugee Law Laboratory https://refugeelab.ca/bulk-data/fc
Data Fields:
- citation1 (string): Legal citation for the document (neutral citation where available)
- citation2 (string): For some documents multiple citations are available (e.g. for some periods the Supreme Court of Canada provided both official reported citation and neutral citation)
- dataset (string): The name of the dataset (in this case “FC”)
- year (int32): Year of the document date, which can be useful for filtering
- name (string): Name of the document, typically the style of cause of a case
- language (string): Language of the document, “en” for English, “fr” for French, “” for no language specified
- document_date (string): Date of the document, typically the date of a decision (yyyy-mm-dd)
- source_url (string): URL where the document was scraped and where the official version can be found
- scraped_timestamp (string): Date the document was scraped (yyyy-mm-dd)
- unofficial_text (string): Full text of the document (unofficial version, for official version see source_url)
- other (string): Field for additional metadata in JSON format, currently a blank string for most datasets
Programmatic Access in Python (via Hugging Face Datasets):
from datasets import load_dataset import pandas as pd dataset = load_dataset("refugee-law-lab/canadian-legal-data", "FC", split="train") # convert to dataframe df = pd.DataFrame(dataset) df
Programmatic Access in Python (via Parquet):
import pandas as pd import requests from io import BytesIO url = 'https://huggingface.co/datasets/refugee-law-lab/canadian-legal-data/resolve/main/FC/train.parquet' # load data results = requests.get(url) # convert to dataframe df = pd.read_parquet(BytesIO(results.content)) df
Programmatic Access in Python (JSON):
import pandas as pd import json import requests # Set variables start_year = 2001 # First year of data sought (2001 +) end_year = 2023 # Last year of data sought (2023 -) languages_sought = ['en', 'fr'] # languages in list base_ulr = 'https://raw.githubusercontent.com/Refugee-Law-Lab/fc_bulk_data/master/DATA/YEARLY/' # load data results = [] for year in range(start_year, end_year+1): for language in languages_sought: url = base_ulr + f'{year}_{language}.json' results.extend(requests.get(url).json()) # convert to dataframe df = pd.DataFrame(results) df