Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix parsing Markdown in HTML #135

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

Fix parsing Markdown in HTML #135

wants to merge 3 commits into from

Conversation

zhouzi
Copy link
Contributor

@zhouzi zhouzi commented Mar 20, 2019

A user reported an issue where the Markdown content of an HTML node (within a Markdown file) is not parsed. The goal of this PR is to fix that.

We already parse HTML in Markdown so it would make sense to parse Markdown in HTML. Here's an example of Markdown containing HTML containing Markdown, which is properly parsed by GitHub:

This div contains Markdown with a link and some bold content.

Source:

<div style="text-align: justify">

This div contains Markdown with a [link](https://www.google.com) and some **bold content**.

</div>

Note that the linebreaks matter, the following:

<div style="text-align: justify">
This div contains Markdown with a [link](https://www.google.com) and some **bold content**.
</div>

Yields:

This div contains Markdown with a [link](https://www.google.com) and some **bold content**.

@zhouzi zhouzi added the wip label Mar 20, 2019
@zhouzi zhouzi self-assigned this Mar 20, 2019
@zhouzi
Copy link
Contributor Author

zhouzi commented Mar 22, 2019

I tried something that didn't work so I thought I'd share the blockers. I've been relying on the CommonMark Spec and more specifically an example from the spec for HTML blocks.

Ideally, we should parse Markdown in HTML blocks that start and end with a line break. The problem is that we are cleaning the input string with htmlclean which removes those line breaks. This sanitization is required to avoid interpreting meaningless code formatting which leads to undesired white spaces and nodes.

I am now thinking about cleaning the HTML through the HTML parser itself. I'll give it a shot.

@Soreine
Copy link
Contributor

Soreine commented Mar 27, 2019

I'm not sure that we can properly clean the HTML by chunk (through the parser). htmlclean needs context to know what can be removed.
However, we could do a first parsing pass, where we detect div that start and end with a line break, and mark them (for example with an HTML attribute), so that we can treat their innerHTML as Markdown in the parser ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants