December
9th 2008
Normalizer of Web Pages, Qualifier of URLs

Posted under Open Source & Web

Relative paths look like /images/filename.jpeg, explaining the relationship between the current location and the resource.  Fully qualified paths are complete addresses, and look like http://domain.com/images/filename.jpg.

Sometimes you need to translate between the two.  Think of absolute urls as the third normal form of the web.

The complete solution has 138 lines in two classes, but this includes fetching the content from the internet.  Filling in a host name and path for each relative link is easy. You can download the full implementation

The juicy bits are below:

public string Code() {
	HtmlDocument doc = WebUtility.GetPage(originalHtml);
	RecursiveQualifier(doc.DocumentNode);

	return cleanHtml = doc.DocumentNode.OuterHtml;
}

private void RecursiveQualifier(HtmlNode node) {
	QualifyNode(node);

	foreach (HtmlNode child in node.ChildNodes)
		RecursiveQualifier(child);
}

private void QualifyNode(HtmlNode node) {
	if (node.HasAttributes)
		foreach (HtmlAttribute a in node.Attributes)
			if (string.Compare(a.Name, "src", StringComparison.OrdinalIgnoreCase) == 0 || string.Compare(a.Name, "href", StringComparison.OrdinalIgnoreCase) == 0)
				if (Uri.IsWellFormedUriString(a.Value, UriKind.RelativeOrAbsolute) && !(new Uri(a.Value, UriKind.RelativeOrAbsolute).IsAbsoluteUri))
					a.Value = QualifyUrl(a.Value).ToString();
}

public static Uri Qualify(string baseUri, string relativePath) {
	if (Uri.IsWellFormedUriString(relativePath, UriKind.Absolute))
		return new Uri(relativePath);

	if (!Uri.IsWellFormedUriString(baseUri, UriKind.Absolute))
		return null;

	Uri b = new Uri(baseUri, UriKind.RelativeOrAbsolute);

	return new Uri(b, relativePath);
}

NoteThe Qualify methods don’t agree, but download the source code, and you’ll see why.

All resources that can be called out to in html use either a src or an href attribute:  images, flash movies, style sheets, scripts, music, etc.  Xhtml 2 allows any element to be a link, so finding particular tag names doesn’t work.  Fortunately, Html Agility Pack comes to the rescue;  this gem is the XmlDocument of sloppy, malformed html.

Recursion lets you start at the root node and walk down the tree, checking every element for the attributes we’re interested in, and “fix” the links.  The Uri class takes care of what would otherwise be tedious string parsing with “/folder” and “../../parent/path” in urls.

One Response to “Normalizer of Web Pages, Qualifier of URLs”

  1. Html Agility Pack « Alexander The Great on 09 Dec 2008 at 6:42 am #

    [...] this hasn’t stopped others from finding creative uses for the library. The page localizer is a fascinating example. And here’s a converter, allowing LINQ over web [...]

Trackback URI | Comments RSS

Leave a Reply