Categories
Uncategorized

Tax Court of Canada Bulk Decisions Dataset

Description: This is a bulk open-access dataset in JSON, parquet and Hugging Face dataset formats with the full text of Tax Court of Canada decisions. The process through which data is processed and code snippets for loading the data are available in a repository on the Refugee Law Lab Github.

Data: https://github.com/Refugee-Law-Lab/tcc_bulk_data/tree/master/DATA/YEARLY

GitHub Repository: https://github.com/Refugee-Law-Lab/tcc_bulk_data

Hugging Face Repository: https://huggingface.co/datasets/refugee-law-lab/canadian-legal-data

Current Coverage: 2003 – 2023 (* to July 24 in 2023)

Number of Decisions: ~15,000

Languages: English & French

Formats: JSON (yearly files), Parquet, Hugging Face Dataset

License: Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0). NOTE: Users must also comply with upstream licensing for the TCC source data, as well as requests on source urls not to allow indexing of the documents by search engines to protect privacy. As a result, users must not make the data available in formats or locations that can be indexed by search engines.

Citation: Sean Rehaag, “Tax Court of Canada Bulk Decisions Dataset” (2023), online: Refugee Law Laboratory https://refugeelab.ca/bulk-data/tcc

Data Fields:

  • citation1 (string): Legal citation for the document (neutral citation where available)
  • citation2 (string): For some documents multiple citations are available (e.g. for some periods the Supreme Court of Canada provided both official reported citation and neutral citation)
  • dataset (string): The name of the dataset (in this case “TCC”)
  • year (int32): Year of the document date, which can be useful for filtering
  • name (string): Name of the document, typically the style of cause of a case
  • language (string): Language of the document, “en” for English, “fr” for French, “” for no language specified
  • document_date (string): Date of the document, typically the date of a decision (yyyy-mm-dd)
  • source_url (string): URL where the document was scraped and where the official version can be found
  • scraped_timestamp (string): Date the document was scraped (yyyy-mm-dd)
  • unofficial_text (string): Full text of the document (unofficial version, for official version see source_url)
  • other (string): Field for additional metadata in JSON format, currently a blank string for most datasets

Programmatic Access in Python (via Hugging Face Datasets):

from datasets import load_dataset
import pandas as pd

dataset = load_dataset("refugee-law-lab/canadian-legal-data", split="train", data_dir="TCC")

# convert to dataframe
df = pd.DataFrame(dataset)
df

Programmatic Access to in Python (via Parquet):

import pandas as pd
import requests
from io import BytesIO

url = 'https://huggingface.co/datasets/refugee-law-lab/canadian-legal-data/resolve/main/TCC/train.parquet'

# load data
results = requests.get(url)

# convert to dataframe
df = pd.read_parquet(BytesIO(results.content))
df

Programmatic Access in Python (via JSON):

import pandas as pd
import json
import requests

# Set variables
start_year = 2003  # First year of data sought (2003 +)
end_year = 2023  # Last year of data sought (2023 -)

# load data
base_ulr = 'https://raw.githubusercontent.com/Refugee-Law-Lab/tcc_bulk_data/master/DATA/YEARLY/'
results = []
for year in range(start_year, end_year+1):
    url = base_ulr + f'{year}.json'
    results.extend(requests.get(url).json())

# convert to dataframe
df = pd.DataFrame(results)
df

NOTES

(1) Data Source: Tax Court of Canada.

(2) Unofficial Data: The data are unofficial reproductions of materials on the Tax Court of Canada website. Links to official versions are included in the dataset.

(3) Non-Affiliation / Endorsement: The data has been collected and reproduced without any affiliation or endorsement from the Tax Court of Canada.

(4) Non-Commerical Use: As indicated in the license, data may be used for non-commercial use (with attribution) only. For commercial use, see the Tax Court of Canada website’s Terms of Use.

(5) Accuracy: Data was collected and processed programmatically for the purposes of academic research. While we make best efforts to ensure accuracy, data gathering of this kind inevitably involves errors. As such the data should be viewed as preliminary information aimed to prompt further research and discussion, rather than as definitive information.

(6) Limitation: Only includes cases with neutral citation, which began to be used in 2003

(7) Delay: Decisions may take many months to be translated (sometimes over a year). As a result, in the most recent years, decisions may only be available in one language.