pdf-to-markdown/examples/The-Impact-of-Open-Access-Latin-American-Scholarship.md
Johannes Zillmann 78db114632 Add Markdown comparison tests
- Convert the `example PDFs` with the old `pdf-to-markdown` and write them to text files
- Compare the text files with the conversion of the current code
- Next:
  - Improve the current code to match good conversions of the old code
  - Adapt the text files in case the current conversion is better than the old
- Current tests are breaking
2024-04-21 09:15:46 -06:00

30 KiB
Raw Blame History

Andrew W. Mellon Foundation
Grant 1711- 05155
December 19 , 2019
John Kiplinger
Valerie Yaw

The Impact of Open Access

Latin American Scholarship:

Digitizing the Backlist of El Colegio de

Méxicos Press

WHITE PAPER

In 2018, JSTOR received a grant from the Andrew W. Mellon Foundation to support the digitization of out-of-print titles from the Dirección de Publicaciones de El Colegio de México, A.C., as well as the dissemination of those titles on an openly accessible basis. Throughout the year-and-a-half-long project, we worked in deep collaboration with El Colegio de México Press to complete this project. This white paper is intended to document the significance of this work, the process we used to select titles, and what we have learned so far about the usage of these titles on the JSTOR platform. We hope this will help to benefit other initiatives interested in increasing access to out-of-print materials.

Copyright 2019 ITHAKA. This work is licensed under a Creative Commons Attribution- NonCommercial 4.0 International License. To view a copy of the license, please see http://creative-commons.org/licenses/by-nc/4.0/.

ITHAKA is interested in disseminating this paper as widely as possible. Please contact us with any questions about using the report at support@jstor.org.

This project was made possible by The Andrew W. Mellon Foundation. Any views or recommendations expressed in this paper do not necessarily represent those of The Andrew W. Mellon Foundation.

The Dirección de Publicaciones de El Colegio de México, A.C. was established in 1938. It offers a catalog of more than 2,400 titles and nine academic journals across the humanities and social sciences.

JSTOR, a service of the not-for-profit organization ITHAKA, collaborates with the academic community to help libraries connect students and faculty to vital content while lowering costs and increasing shelf space; provides independent researchers with free and low-cost access to scholarship; and helps publishers reach new audiences and preserve their content for future generations.

JSTOR gratefully acknowledges the contributions and cooperation of the following:

  1. Gabriela Said Reyes, Director, Dirección de Publicaciones de El Colegio de México, A.C.
  2. Ninel Salcedo Romero, former Director of Marketing, Dirección de Publicaciones de El Colegio de México, A.C.
  3. Brian Connaughton, Área de Historia Regional y Comparada, Departamento de Filosofia, Universidad Autónoma Metropolitana
  4. Robert Darnton, Carl H. Pforzheimer University Professor and University Librarian, Emeritus, Harvard University
  5. Gilbert Joseph, Farnam Professor of History & International Studies, Yale University; Past President, Latin American Studies Association
  6. Herbert S. Klein, Gouverneur Morris Professor Emeritus of History, Columbia University; former Director of the Center for Latin American Studies and Professor of History at Stanford University; Research Scholar & Latin American Curator, Hoover Institution, Stanford University
  7. Jocelyn Olcott, Associate Professor, History and Gender, Sexuality & Feminist Studies, Duke University
  8. William B. Taylor, Muriel McKevitt Sonne Professor of Latin American History, Emeritus, University of California, Berkeley
  9. Pardha Karamsetty, President, Content & Media Solutions, Apex CoVantage; CEO, Apex CoVantage India
  10. Prabhanjan Mattam, Project Manager, Apex CoVantage

Summary

In 2018, JSTOR received a grant from the Andrew W. Mellon Foundation to support a collaboration with the Dirección de Publicaciones de El Colegio de México, A.C., the press of El Colegio de México, a graduate research institution in Mexico City^1. This grant enabled JSTOR to digitize nearly 700 books from the presss backlist in the humanities and humanistic social sciences, and make these books freely and openly available on the JSTOR online platform.

The goal of this project was to digitize and make openly accessible scholarship from the backlist of El Colegio de Mexicos Press that would be of significant value to students and researchers in a range of humanities disciplines.

The work on this project proceeded in three phases, including a preparation and selection process, in which JSTOR worked with experts in the field to determine which books would be digitized; a digitization and ingest phase resulting in the books being hosted openly on JSTOR; and an analysis phase, in which JSTOR sought to develop a better understanding of the impact that foreign-language materials can have when hosted on a global platform.

This project brought together Colmexs rich scholarly backlist with JSTORs experience managing retrospective digitization projects and helping to increase the impact of academic content by making that content easy to find and use online. Colmex and JSTOR have collaborated over the past several years to make Colmexs frontlist books available to readers around the world through JSTOR.org. In this project, we sought to build on that collaboration by making a selection of books from the Presss backlist available in digital form for the first time. In this white paper, we document our process for selection and digitization of books and provide a high-level analysis of usage of the content on the JSTOR platform.

Introduction: History, Context, and

Significance of the Collection

The press of El Colegio de México has published a body of important scholarship over the course of the last eight decades.

(^1) Throughout this paper, we generally refer to Dirección de Publicaciones de El Colegio de México, A.C. simply as El Colegio de México or by its common name “Colmex.”

The press was established in 1938 in Mexico City. It attracted a group of pathbreaking scholars in the humanities and social sciences, and Colmexs press—one of the earliest scholarly publishers in Latin America—provided an outlet for their work, which foregrounded some of the ongoing lines of inquiry in Mexican and Latin American studies, including scholarship on migration to and from Mexico, the interplay between church and state in Latin America, and womens rights.

The universitys press published its first title in 1938 and continued to publish significant work throughout its history. The list of the press spans disciplines in the humanities and qualitative social sciences, with special emphases on history, sociology, literary criticism, and political science. For the most part, the books focus on Mexican and Latin American contexts.

In addition to a robust books program, the press of El Colegio de Mexico publishes seven journals, including Historia Mexicana, arguably the leading journal of Mexican historical studies. Over time, the press has also been an important outlet for making foreign-language writing available in Mexico: as one example, its journal Diálogos was the first to publish Milan Kundera's work in Spanish for a Mexican audience.

Since 2013, Colmexs press has published some of its new books in digital form and distributed them through digital scholarly platforms, including JSTOR. Like many established scholarly presses, Colmex licenses access to its frontlist titles to university libraries to help sustain its ongoing publishing program. However, much of Colmexs backlist was out of print and the press had never digitized it due to limited funding. In todays increasingly digital landscape, the lack of electronic copies of this important body of scholarly created, in essence, a barrier to accessing those titles.

This project sought to overcome this barrier and make these books discoverable and accessible for free by a worldwide audience. As noted in the Summary, El Colegio de México and JSTOR have collaborated over the past several years to make Colmex's frontlist books available to readers around the world through JSTOR. In this project, we built on that collaboration, bringing together Colmex's rich scholarly backlist with JSTOR's experience managing retrospective digitization projects and helping to increase the usage of academic content by making that content easy to find and use online. JSTOR has seen high usage and impact for both archival journals and for backlist monographs; in fact, two thirds of ebook usage on JSTOR is for titles published at least three years earlier.

Our Approach: Selection and Digitization

JSTOR digitized nearly 700 titles, or almost 50% of the presss backlist. Significantly, none of Colmexs backlist titles were previously available digitally. For every book made available through this project, each page was scanned and OCR processed, and accompanying book and chapter-level metadata was captured to make the books fully searchable, discoverable, and usable for scholars and teachers.

Selection

We asked a group of scholar-advisors to help us assess the broader significance of Colmex's list in Mexican and Latin American Studies by drawing our attention to books that are noteworthy and that should be highlighted in outreach about the project to scholars, librarians, students, and general readers.

Our scholar-advisors assisted with the selection process mainly in two ways. First, they gave us high-level guidance to inform our strategic sense of the collections value. One advisor wrote to us that the press's list “[provides] studies of the economic, social, demographic, and political history of Mexico unparalleled by any other publisher.”

Several of the scholars also noted the broad discipline coverage of Colmex's list; while we expected that the bulk of the books would be of greatest interest to historians, another advisor wrote to us that “[s]ociologists, economists, demographers, linguists and students of literature, geographers, and historians will all benefit by achieving the digital availability of these works.” It is worth noting, as some of our advisors did, that the Press also has a strong list in Asian studies, and the set of titles that we digitized through this project includes books from that area. While the inclusion of these titles may initially seem like an odd fit for a project that focuses for the most part on Mexican and Latin American studies titles, the press's list in Asian studies reflects a critical aspect of the Mexican academy's global engagement. Colmex's Center of Asian and African Studies is, as one adviser noted, “the only functioning center on Asian studies in Latin America,” and Colmex's press, picking up on this strength, has become “the major publisher of studies of Asian history in Spanish.” To the extent that this digitization project is meant in part to reflect the strengths and disciplinary breadth of Colmex's backlist, it seemed important to include these titles in the project.

Second, while acknowledging the overall value of Colmex's backlist, our advisors also directed us to particular titles that have become classics in their field. For example, some of these titles include Silvio Zavala's multi-volume El servicio personal de los indios en la Nueva España, a study of labor and slavery in the 16th to 18th centuries; books and edited volumes by Andrés Lira on Spanish exiles in Latin America after the Spanish Civil

War; and Los bienes de la Iglesia en México, a study of the conflict between church and state in the 1800s.

Of particular note among the books we digitized is the Historia general de Mexico, a multi-volume work completed in the 1970s and edited by the Colmex historian Daniel Cosío Villegas. This work covers the range of Mexico's history from the dawn of human habitation. As one reviewer in a scholarly journal noted, Cosío Villegas had a longstanding interest in reaching non-academic audiences, and so the scholars who penned essays for the Historia general were asked to write such that a general audience could read the work. Thus, one project advisor wrote, the volumes are well suited to “students at the high school and university level as well as to adult readers who give them the time and attention they deserve.” Despite the essays being shaped for a non- academic audience, one of our advisors noted that the Historia general remains “the standard general history [of Mexico] used by all scholars.”

With this guidance in mind, the list of books we digitized resulted from a winnowing process, the stages of which are outlined below^1 :

(1) At the start of this project, Colmex had the necessary permissions to digitize and
make freely available in digital form a significant number of titles in their backlist,
in many cases because the author was a faculty member at Colmex. Given the
sizable expense involved in clearing digital rights, we determined that there was
significant value in focusing our efforts on books that did not require painstaking
rights research. Of the 1,411 titles in the backlist, Colmex's press has distribution
rights for 741.
(2) This list was then refined to exclude a small number of books that were not
scholarly in nature (e.g., technical guides from the 1990s). We retained in the list,
however, a small number of literary or primary source titles that would be useful
for research and teaching.
(3) The list was further refined to exclude titles that did not fit well with the
humanities and humanistic social sciences profile^2. For example, books that
focused on environmental policy were considered out of scope for this project.
(4) Finally, based on cost estimates, we initially aimed to reach a final list of
approximately 600 titles. Given cost constraints, we made the difficult choices,
including moving approximately 40 social science-leaning titles (many in political

(^1) It is important to note that the winnowing process was undertaken by the project team with guidance from a set of scholar-advisors for the project, given that it was not feasible to ask these advisors, who are also full-time faculty members, to engage in a title-by-title selection process for a list of this size. (^2) This project was funded through the Mellon Foundations Humanities Open Book Program, which emphasizes out-of- print humanities books.

science) to a B-list. It is important to note that, while these titles were not
included in the starting list for digitization, lower-than-anticipated costs allowed
us to include these titles in our final output. We acknowledge that this initial
selection process was not perfect, but we are pleased with the final outcome since
these books hold value for humanities researchers (especially historians).

At the end of our initial selection process, we had an A-list of 611 titles. While the vast majority of the books on this list were in history, literature, or other humanities fields, there were also a number of titles that were exceptions. Some titles on the list leaned more toward the social sciences, including a number of books on public policy. We felt that it would be appropriate to include them because they would be of interest to scholars of Mexican and Latin American history. In addition, a handful of titles on the list (fewer than ten) are literary or primary texts (for example, a Spanish-language translation of Giambattista Vico's Scienzia nuova).

Production

JSTOR's production unit converts over 9 million pages of scholarly journal and book content per year, of which 2 million includes scanning from print sources. We have longstanding relationships with several digitization vendors, and we believed that our experience managing large-scale digitization projects would position us well to accomplish the digitization of Colmex's backlist books quickly, cost-efficiently, and to a high quality.

For books, JSTOR normally receives and processes PDFs from publishers. These PDFs go through automated workflows at JSTORs end as well as processing by a third-party vendor. This project was different because the source document for each book was a print version^1 , and one of the required outputs was an ePub for each book. JSTOR selected one of our current conversion vendors, Apex CoVantage, to handle all vendor processing for the books in the project. This included scanning of the print copy, return shipment of the print copy to Colmex, creation of the PDF from the page images, OCR for creation of searchable full text, metadata capture to JSTOR standard specification for books, and then creation of an ePub. JSTOR negotiated a per page price of $ 0 .83 that covered all these tasks. The project covers 684 books.

Colmex sent nine shipments of print books to Apex CoVantages production facility in Hyderabad, India. The initial batch was shipped mid-April 2018, and the final batch was shipped early-May 2019. Each shipment contained an average of 76 print books. Apex conducted non-destructive scanning with each page scanned as 600 dpi bitonal TIFF

(^1) Although JSTOR has scanned a relatively small number of print books outside this project, the bulk of our print scanning continues to be for journals. However, the same imaging specifications are used regardless of whether they are journal or book pages.

and grayscale/color content scanned at 300 dpi for RGB TIFF images. We instituted a discrepancy process wherein Apex reported damaged, missing, or other problematic pages to JSTOR. JSTOR assessed these reports and, as needed, worked with Colmex, Harvard University Library, and University of Michigan Library collections to locate and scan replacement pages from other extant copies. The resulting page scans were then used in place of the damaged or otherwise unusable pages or to fill gaps where there were missing pages so that the PDF would represent a complete and intact version of the print original.

Apex submitted the completed PDFs to JSTOR and shipped the print copies back to Colmex. JSTORs systems then ingested the PDF as well as spreadsheet-based supply chain metadata (SCM) provided separately by Colmex. The PDF and SCM were matched by the system and then were automatically sent to Apex for standard processing, which consists of OCR as well as book- and chapter-level metadata capture.

As Apex completed the standard processing for each book, they then put the books through an ePub creation process that, while very familiar to Apex, was new to JSTOR. The ePubs were created to the EPUB standard version 3.0.1 or higher. Additionally, the processing agreed upon between Apex and JSTOR ensured functionality such as links from footnote anchors in the text block to the footnotes themselves. However, features such as tables were captured as images rather than as HTML. During both the standard processing and ePub creation, Apex occasionally raised metadata capture queries that were reviewed and resolved by JSTORs metadata librarian team of Karen Aufdemberge, Emily Betwee, and Rachel Ross, thus ensuring a higher and more consistent quality for the metadata.

Apex grouped the ePub, the PDF, and the book- and chapter-level metadata XML files into a zip file for delivery to JSTOR. JSTOR systems then ingested the zip file and ran quality control scripts across the files to ensure that they adhered to our specifications. For the initial batch of books, we also conducted a limited amount of manual quality control reviews of the metadata and of the ePub. To accommodate the ingest of these zip files, however, our content management systems staff had to update the JSTOR software to recognize and accept the different directory structure and files that were present (i.e., the directory containing the ePub as well as the ePub file itself) but had not been present in previous book deliverables from our vendors.

Furthermore, downstream systems for our content delivery platform had to be updated to recognize and appropriately route and make available the ePub file. JSTOR opted to treat the ePub in a manner similar to that of supplementary materials. The ePub is available as a downloadable file via a clickable “Download EPUB” button at the top of the page for each book in the project. Otherwise, the book is treated in a similar manner to any other Open Access title on JSTOR.

Of the 68 4 books in the project, the first titles became available on the JSTOR site on September 11, 2018. The most recent releases were on July 19, 2019. There are currently four books for which processing cannot be completed because the books do not have ISBN assignments, and the JSTOR systems require an ISBN.

While this project had typical logistical challenges, the challenges that were new to JSTOR were:

  1. The need to send the books to one particular vendor instead of dividing them equally between our two vendors, which was addressed earlier in these comments; and
  2. The lack of electronic version ISBN (EISBN) assignments for any of the books.

ISBN best practice indicates that an electronic version of a book should have an ISBN that is distinct from its print version counterpart. In fact, different electronic versions (e.g., PDF vs. EPUB) can have their own ISBN assignments. However, JSTOR opted to use a single ISBN assignment to cover both electronic versions of each book. Going into the project, Colmex did not have EISBN assignments for the books, and 102 of the books did not have a print version ISBN (PISBN) either. One problem was that, for Mexican- published works, the ISBN are assigned by a third-party agency, and the turnaround times for the assignments, particularly for large batches of requests, are unpredictable. Therefore, to not to jeopardize the overall timeframe for the project, JSTOR opted to use the PSIBN for any book that had a PISBN assignment. This would allow us to ingest the supply chain metadata into JSTOR systems and to keep individual book processing moving beyond the print-scanning stage.

Meanwhile, Colmex would apply for EISBN assignments attempting to prioritize the assignments for those books that had no ISBN assignment at all. For books that had no ISBN assignment at all, we could move them back into production post-scanning once we had the EISBN assignment. For books that had a PISBN assignment, we plan to do a mass swap of the PISBN for the EISBN once we have all those assignments. At the time this paper was written, 1 28 books were still awaiting an EISBN assignment, including four books that have no ISBN assignment at all and that therefore cannot proceed beyond the scanning stage.

We are currently planning a project to swap the PISBN for the EISBN for those books where we have the EISBN assignments. We will finish the processing and/or ISBN swaps for the remaining books when the EISBN assignments are available. If we were to do a similar digitization project for backlist books, we would certainly investigate the EISBN situation at the earliest possible stage and work with project partners to secure EISBN assignments as soon as possible.

Usage: What Weve Learned So Far

This project represented not only an opportunity to digitize and make available books from the publication run of Colmex's list, but also to measure the usage of these books over time and, ultimately, to understand better the impact that foreign-language materials can have when hosted on a globally accessed platform.

Our objective in measuring usage was to understand how frequently the Colmex books are read online as evidenced by generally accepted metrics such as views and downloads of the chapter files. Additionally, we wanted to understand how this usage compares with the usage of approximately 4 ,500 openly accessible English-language books hosted on JSTOR.

JSTOR facilitates the discovery of ebook content in a variety of ways. We offer free MARC records to libraries through OCLC, and distribute metadata and full text to discovery services and search engines for indexing. Another important factor in driving usage is co-locating ebook chapters with journal articles on JSTORs integrated platform, enabling users to cross-search all types of content at once. For many scholars, JSTOR is a starting point for research—in fact, our traffic referral data shows that more than 40% of visits to ebook pages are by users who were already searching and using JSTOR. Faculty and students are incorporating ebooks into their established research workflows on the platform. In addition, we promoted the availability of the Colmex titles via a short animated video in English and Spanish, email campaigns to librarians and faculty in Latin American studies, announcements shared via JSTOR and Colmexs web and social media channels, and promotions to members of the Latin American Studies Association, including advertisements and a presentation at the associations annual conference.

The Colmex titles digitized through this project have been heavily used on JSTOR. The 680 titles made available on JSTOR between September 2018 and July 2019 have been used a total of 502,134 times through October 28, 2019. Every single title has been used. The most-used titles are listed below.

Top ten most-used titles

Title Copyright
year
Usage through
10/28/
Historia económica general de México: de la
colonia a nuestros días

2010 13 , 251

Historia general de México: volumen I 1994 9 , 323
Los intelectuales y el poder en México 1991 5 , 156
De amicitia et doctrina: homenaje a Martha
Elena Venier

2007 4 , 785

La lingüística en México, 1980- 1996 1998 4 , 564
Diccionario del español usual en México 1996 4 , 503
Introducción a la historia de la vida cotidiana 2006 4 , 300
Historia de la lectura en México 1997 3 , 888
Cuestiones de teoría sociológica 2005 3 , 659
Historia general de México: volumen II 1994 3 , 595

The data show that there is a broad audience for this scholarship. The titles have been used in 173 countries and territories. While high levels of usage were recorded in Spanish-speaking countries, as we expected, usage also occurred in 161 countries and territories where Spanish is not a national or official language. The map below shows the countries in which we have recorded usage for the Colmex titles, and the table lists the ten countries with the highest usage.

Top ten countries that recorded the most usage

  • Country Usage through 10/28/
  • Mexico 151 ,
  • United States 54 ,
  • Colombia 29 ,
  • Spain 17 ,
  • Argentina 13 ,
  • Peru 11 ,
  • Chile 9 ,
Ecuador 9 , 143
Costa Rica 4 , 770
United Kingdom 3 , 580

Because JSTOR works with thousands of institutions around the world, we can measure the usage of these titles at institutions that participate in our services. We recorded usage of the Colmex titles at 4,285 institutions. This included not only college and universities, but also community colleges, secondary schools, government and not-for-profit organizations, and public libraries.

JSTORs ebook program had not previously hosted EPUB files; for this project, we added the capability for users to download the full book as an EPUB file from the table of contents page, as well as the standard option to view or download chapter-level PDFs. There were 19,234 downloads of EPUB files for the Colmex titles through the end of October 2019—just 3.8% of the total usage of the titles in that timeframe.

This project also gave us the opportunity to compare the usage of Spanish and English- language titles available on JSTOR. On average, the Colmex ebooks are used 57% as much as the Open Access titles in English on the platform. While there are other variables that may affect the level of usage (such as discipline or copyright year), this figure shows an impressive amount of usage of Spanish-language titles on a primarily English-language scholarly content site.

Weve also received positive feedback from librarians and scholars regarding the access to this content. For example, responses to the news on Twitter included praise for the initiative (“Excelente noticia para @elcolmex y el ámbito académico de México y el mundo”) and recommendations of specific titles (“Una de las joyas liberadas en acceso abierto [PDF / EPUB] por el Colmex a través de Jstor es /Los intelectuales y el poder en México/ (1991) un nutrido volumen colectivo que contiene muy buenas intervenciones, algunas de ellas referencias obligadas.”)

Conclusion

As a result of this project, 68 0 significant works of scholarship (with four more coming when EISBN assignments are available) that were previously out of print are now

available to anyone who wishes to use them. They are easy to discover and access within researchers existing digital workflows. The value of these titles is apparent in the strong usage weve seen over the relatively short period theyve been available: more than half a million views and downloads across 173 countries. Scholars and students in Latin America and around the world are enriching their research with this content, and we have ensured that it will be available to future generations.

In addition, the Mexican government launched a project earlier this year, the Estrategia Nacional de Lectura, to promote reading and guarantee that books are accessible to the entire population. The 684 digitized titles will be openly available to the Mexican people and promoted as part of this project.

This project also built a foundation for continued work on the Open Access dissemination of Latin American scholarship. JSTOR is currently participating in a pilot led by the Latin American Research Resources Project (LARRP), a consortium of research libraries that is funding the Open Access distribution of 200 titles published in 2018 - 2019 by the Latin American Council of Social Sciences (CLACSO). This initiative, developed and supported by libraries, will test a framework for the sustainable, long- term stewardship of Open Access scholarly monographs.

We are grateful that the Humanities Open Book program grant funded by The Andrew W. Mellon Foundation provided the opportunity for JSTOR to partner with El Colegio de México to make its important scholarship available for researchers around the world to discover and use. We look forward to continuing to build on what weve achieved together.