This on_the_books_text_jc_all_v2_readme.txt file was generated on 20210929 by Amanda Henley ------------------- GENERAL INFORMATION ------------------- 1. Title: Session Laws Passed by the North Carolina General Assembly during 1866/67-1967, Identified by Machine Learning as Laws Likely to be Jim Crow Laws (plain text format, single file) version 2 2. Author Information Principal Investigator Contact Information Name: Amanda Henley Institution: University of North Carolina at Chapel Hill University Libraries Email: amanda.henley@unc.edu Co-Principal Investigator Contact Information Name: Matt Jansen Institution: University of North Carolina at Chapel Hill University Libraries Email: mtjansen@email.unc.edu Creator Contact Information Name: Lorin Bruckner Institution: University of North Carolina at Chapel Hill University Libraries Email: lorin.bruckner@unc.edu Creator Contact Information Name: William Sturkey Institution: University of North Carolina at Chapel Hill University Libraries Email: wsturkey@live.unc.edu Creator Contact Information Name: Kimber Thomas Institution: University of North Carolina at Chapel Hill University Libraries Email: ksymone@live.unc.edu Additional Creators and researchers: Neil Byers, Sarah Carrier, Rucha Dalwadi, Grant Glass, and James Dick. 3. Date of data collection Primary Source materials dated 1866-1967. Images digitized and added to the Internet Archive 2009-2011. Images OCR'd and corpus created 2019-2021. 4. Geographic location of data collection: North Carolina 5. Information about funding sources that supported the collection of the data: Creation of this corpus was funded by the Andrew W. Mellon Foundation as part of the first cohort for Collections as Data: Part to Whole, the Association of Research Libraries' Venture Fund Award, and the University of North Carolina at Chapel Hill University Libraries' IDEA Action Fund. --------------------- DATA & FILE OVERVIEW --------------------- 1. File List. on_the_books_text_jc_all_v2.txt is a text file of 1,939 laws identified as a likely Jim Crow law. 2. Are there multiple versions of the dataset? Yes. This is the second release. Version 1 was released August 31, 2020 (version 1). The second data set was released September 29, 2021. -------------------------- METHODOLOGICAL INFORMATION -------------------------- 1. Description of methods used for collection, generation, and processing of data: Methods are fully explained in the project white paper: https://doi.org/10.17615/5c4g-sd44 All automated data processing was done using Python. Briefly, the methods used to create the corpus can be divided into these major stages: i. Data Acquisition ii. Marginalia Determination iii. Image Adjustment Recommendations iv. Optical Character Recognition (OCR) v. Section Splitting & Cleaning vi. Analysis vii. Text File Creation Data Acquisition During data acquisition, images and metadata were gathered through a combination of automated downloads from the Internet Archive and manual metadata creation. Marginalia Determination Marginalia, text printed in the margins of the pages to serve as a finding aid, was printed in the corpus volumes prior to 1951. The marginalia are not part of the laws and needed to be left out of the OCR process, as did paratextual information from page headers and footers. The marginalia determination process involved identifying the coordinates of the main text body for OCR. The marginalia determination process also identified the median page color to allow for the creation of a blank, color-neutral border around the main body text on each page. Tesseract OCR performs best when the text is not too close to the edge of the page. Image Adjustment Recommendations Once the marginalia cropping information had been compiled, various image adjustments were tested for each volume to maximize OCR performance. A sample of images for each volume was selected and tested using different values for a range of parameters (color, contrast, etc.). Once the optimal image adjustments for each volume had been determined, these were stored for use during the following OCR stage. Optical Character Recognition (OCR) Images were pre-processed based on image adjustment recommendations with Python Imaging Library, Pillow 7.0.0. Adjustments included removing the skew on text, changing parameters such as color and contrast, and removing marginalia and paratextual information. OCR was performed on each page of each volume to produce a series of output files. This step was accomplished using Tesseract OCR, which was accessed programmatically via a pytesseract wrapper. Tesseract version v4.1.0.20190314 was used for all volumes except 1913 extra session, which was done with Tesseract v5.0.0-alpha. Section Splitting & Cleaning After completing OCR, each volume was 'split' into its constituent chapters and sections, with each section representing an individual law. This was accomplished using regular expression pattern matching. Once initial assignments had been made, the corpus underwent a lengthy cleaning process that eliminated most section and chapter assignment errors. For version 1, the corpus of all session laws contained 53,515 chapters and 297,790 sections. Initially, 27,327 chapter/section split errors were identified. 89.7% of the errors were corrected for version 1 of the corpus. Additional work was done to correct the remaining known errors during phase 2 of the project, reflected in version 2. Analysis Jim Crow laws were identified using supervised classification. A training set was compiled from pre-existing work by Pauli Murray[1] and Richard Paschal[2] and expanded by expert reviewers doing close reading. A combination of preliminary classifications and expert review was used to expand the existing labeled training set. The resulting expanded training set was used to classify the entire corpus as "Jim Crow" or "not Jim Crow". This step was accomplished using scikit learn and XGBoost to build and evaluate models. For text processing, NLTK was used. All laws identified as Jim Crow by the model were reviewed by an attorney on the project team, and only the laws they confirmed to be likely Jim Crow laws were included in the final output. The reviewer identified 326 false positives out of 1716 laws identified by the algorithm. The most common false positives were overwhelmingly mischaracterizations of the words “race”, “white”, and "colored", when they were used in other contexts. The most common of this subset (~150) where miscategorizations of the word “white”. [1] Murray, Pauli. 1951. States’ Laws on Race and Color: And Appendices Containing International Documents, Federal Laws and Regulations, Local Ordinances and Charts. Cincinnati: Woman’s Division of Christian Service, Board of Missions and Church Extension, Methodist Church. [2] Paschal, Richard. 2020. Jim Crow in North Carolina The Legislative Program from 1865 to 1920. Durham: Carolina Academic Press. Text File Creation Following analysis, the corpus was prepared for dissemination from the Carolina Digital Repository. All of the laws identified as likely Jim Crow laws were added to a single text file. DIFFERENCES BETWEEN VERSIONS 1 & 2: For version 2, the Section Splitting & Cleaning was extended to address all known splitting errors. The laws identified as likely Jim Crow laws for version 2 were all validated by an attorney on the project team. The training set used for version 2 was larger than that used for version 1. The training set for version 2 included laws from version 1 that were identified by the algorithm and confirmed by an expert on the project team. 2. Describe any quality-assurance procedures performed on the data: Quality assessment is fully explained in the project white paper: https://doi.org/10.17615/5c4g-sd44 Briefly, OCR quality was assessed at both the page level and at the word level. During the quality assessment and processing steps, the project team identified several types of errors: • Some areas of text were skipped during OCR, resulting in gaps in the text. • Some areas of text were erroneously excluded due to incorrect marginalia determination. • Some words were not OCR’d correctly because they were not delineated correctly by Tesseract. • Some words were delineated correctly, but not OCR’d correctly. • Pages that show text in tabular format (tables) did not OCR well. Based on our assessment, the words in the corpus were OCR’d correctly 83.76% of the time, and we estimate that at least 94% of the pages do not have significant OCR errors. When splitting the text into chapters and sections, the team found that numbers were frequently OCR’d incorrectly, especially 3’s and 8’s. Some sections of text were skipped by OCR, either because of errors associated with header and marginalia removal, or because text was not readable by the software. Missing areas of text that were identified during the chapter splitting process were transcribed by hand (which is subject to error). The Jim Crow corpus contains 1,939 laws. It is not a comprehensive compilation of all Jim Crow laws enacted during the period of study. All laws identified as Jim Crow by the model were reviewed by an attorney on the project team, and only the laws they confirmed to be likely Jim Crow laws were included in the final output. 3. People involved with sample collection, processing, analysis and/or submission: The following Library student workers assisted with processing and preparing the corpus: Montana Eck, Julia Long, Ashley Mullikin, Siri Nallaparaju, Tim Oyeleke, and Jenna Patton ----------------------------------------- DATA-SPECIFIC INFORMATION ----------------------------------------- 1. Number of variables: For each law, the volume date, law type, chapter, section, and identification source are provided. For each law, "Identified by:" will specify either "expert" or "model". "Identified by:" specifies how the Jim Crow laws were identified. Jim Crow Laws were identified either by an expert or by the machine learning model and verified as a likely Jim Crow law by a member of the project team. -------------------------- SHARING/ACCESS INFORMATION -------------------------- 1. Licenses/restrictions placed on the data: Attribution non-commercial 3.0: https://creativecommons.org/licenses/by-nc/3.0/ 2. Was data derived from another source? Yes If yes, list source(s): The data was created performing OCR on images from the Internet Archive. The collection used for this project was digitized between 2009-2011 under the IMLS grant Ensuring Democracy through Digital Access, a partnership between East Carolina University, the State Library of North Carolina, and the University Libraries at the University of North Carolina at Chapel Hill. Some laws included in the Jim Crow corpus were identified from the following sources: Murray, Pauli. 1951. States’ Laws on Race and Color: And Appendices Containing International Documents, Federal Laws and Regulations, Local Ordinances and Charts. Cincinnati: Woman’s Division of Christian Service, Board of Missions and Church Extension, Methodist Church. Paschal, Richard. 2020. Jim Crow in North Carolina The Legislative Program from 1865 to 1920. Durham: Carolina Academic Press. 3. Recommended citation for the data: University Libraries, University of North Carolina at Chapel Hill. Session Laws Passed by the North Carolina General Assembly during 1866/67-1967, Identified by Machine Learning as Laws Likely to be Jim Crow Laws (plain text format, single file) version 2, from On the Books: Jim Crow and Algorithms of Resistance. 2021. https://doi.org/10.17615/5c4g-sd44 (date accessed). This readme template adapted from a template by the Cornell University Research Data Management Service Group.