Paper to HTML Tutorial

Last updated: August 10, 2022

Paper to HTML is a web app that converts scientific papers into HTML. It was designed primarily to help improve the accessibility of scientific papers to blind and low vision users and users of assistive reading technology like screen readers or text-to-speech, though it may also benefit users of mobile or small screen devices. This tutorial describes the main features of this site.

  1. Upload a paper to Paper to HTML
    Select the "Choose File" button, and select a paper to process. This file is most often a PDF, but can also be the LaTeX source or XML document representing the paper. Once you have selected a paper, you can click the "Upload" button to begin processing. This usually takes between 30 seconds to 2 minutes for each new paper uploaded into the system. During this time, the system is running a series of machine learning models to extract content from the document.
    A screenshot of the upload screen at papertohtml.org. The box says 'Select a paper and click Upload' followed by a file selection dialogue and upload button.
  2. When processing finishes, the resulting HTML is shown
    This page can be bookmarked to return to this paper. The bookmark should default to saving based on the title of the uploaded paper, if it is properly extracted.
    A screenshot of the first few elements in the HTML render of the paper 'Improving the Accessibility of Scientific Documents' showing the title, authors, beginning of the table of contents, and the first few sentences in the abstract.
  3. If major issues occur when processing the paper, these errors will be listed at the top of the page; these errors may indicate a low quality extraction
    A screenshot of error messages shown when processing quality is low. The text is red and says 'Warning: no authors were extracted. Warning: few to no references were detected. High number of warnings indicates low parse quality, please proceed with caution and refer to the original document...'
  4. Navigate between sections of the paper using the extracted section headers
    These are surrounded by the <h2> HTML tag, for example: the Data & Methods section
  5. Navigate to extracted tables and figures
    These are surrounded by the <figure> HTML tag, for example: Figure 1
  6. Section headings and extracted tables and figures are listed under the Table of Contents, which is located near the top of the page following the title and authors
    A screenshot of the first few elements in the Table of Contents, with sections such as Abstract, Introduction, and Related Work, and the Figures and Tables under each section.
  7. Individual sections within the paper can be shared
    For example, this link goes to the Data & Methods section of the research paper about this app.
  8. Some tables are converted into HTML for improved ease of reading
    For example: Table 2 from an example paper
    A screenshot of Table 2 from the example link. The table caption is followed by the HTML extraction of the table content consisting of 4 columns and 7 rows. The first row consist of table headers.
  9. Bibliography entries are presented in the last section of the document, under the heading "References," as below:
  10. Inline citations in the document are linked to their corresponding entries in the bibliography; return links in the bibliography after each reference entry can take the reader back to their previous reading location
    A snippet from the main text of a paper reads 'Scientific literature is most commonly available in the form of PDFs, which pose challenges for accessibility [6, 34].' When the '34' link is clicked, it takes the reader to the corresponding entry in the bibliography section, a paper by Nielsen and Kaley. The return to section links include a link to the Introduction section, which takes the reader back to the initial location of the link that reads '34'.
  11. Low quality extractions are labeled
    You may encounter the following text: "Not extracted; please refer to original document." These cases represent low quality extractions where the user may be better off going to the source document.
    A snippet of text that says 'EQUATION (2): Not extracted; please refer to original document.'
    A screenshot of a placeholder figure image with the caption text 'Figure 1: Not extracted; please refer to original document.'
  12. System shortcuts like Ctrl-F (find) and Ctrl-C (copy) work well with text in the HTML render
    A screenshot of a few sentences and a figure and figure caption from an example paper, where Ctrl-F is used to highlight two occurrences of the phrase 'compliance rates.'
  13. Paper to HTML can be used in conjunction with various web translation tools
    For example, Google Translate (available for Chrome and Firefox) can quickly translate the whole document into 133 languages. Of course, these are external tools so we have not performed any validation nor can we guarantee quality.
    A screenshot of an example paper being translated into Spanish using the Google Translate Chrome extension.
  14. You can also save the HTML document for offline reading
    Select 'Edit, Save as...' in your browser. However, please note that the within-document navigation links will not function in the saved page.

Still have unanswered questions? Check out our about page or email accessibility@semanticscholar.org.