While extending this site through BlogEngine.Net, I wanted to offer PDF versions of the posts. Creating a PrintView was fairly trivial (just remove CSS), but having a PDF generated has turned out to be a bit trickier than I thought.
The requirements for me were that the API needed to be open source and compatible with C#. I have used iText before for getting the page count of PDF files. I knew iText could generate PDFs easily, but what it can’t do is create PDFs from URLs.
There is no such method called document.CreateFromURL(). So in order to make PDFs when a user clicks ‘PDF’ from a post required a bit of backend logic/work.
The first thing is to understand how iText makes a PDF from HTML. iText doesn’t take a URL, it must be a HTML file on the local file system. For integration with BlogEngine, I added two folders into the App_Date path, which IIS has write privilege.
1: string pdfDir = System.Web.HttpContext.Current.Server.MapPath("\\App_Data\\posts-pdf\\");
2: string htmlDir = System.Web.HttpContext.Current.Server.MapPath("\\App_Data\\posts-pdf\\html\\");
The next step is to realize that iText HTML to PDF is a two-step process. First you need to convert the Post or whatever object contains the content to strict HTML.
1: public static string ConvertPostToCleanHtml(Post p)
2: {
3: string pathToHtmlFile = "";
4: string postId = p.Id.ToString();
5:
6: // Build out the html via string
7: string html = "<html><head></head><body>" + Environment.NewLine;
8: ...
9: html = html + "</body></html>" + Environment.NewLine;
10:
11: // Write string to html file on local server
12: try
13: {
14: MemoryStream m = new MemoryStream(System.Text.Encoding.Default.GetBytes(html));
15: pathToHtmlFile = htmlDir + postId + ".html";
16: FileStream s = new FileStream(pathToHtmlFile, FileMode.Create, FileAccess.Write);
17: m.WriteTo(s);
18: s.Close();
19: }
20: catch (System.Exception)
21: {
22:
23: throw;
24: }
25:
26: return pathToHtmlFile;
27: }
The problem I found with this is that the HTML needs to be strict. That means <p> needs a closing </p>. If there is no closing tag, then the HTMLParser doesn’t process that section. In testing, I found that the default editor in BlogEngine doesn’t add the closing </p>.
The next step after creating a clean local HTML file is to convert it to PDF. This is where you would call the iTextSharp functions. As shown in the code snippet, once the HTML file exists, the process is straight-forward.
1: public static string ConvertCleanHtmlToPdf(string pathToHtmlFile, Post p)
2: {
3: string pathToPdfFile = "";
4: string postId = p.Id.ToString();
5: pathToPdfFile = pdfDir + postId + ".pdf";
6:
7: try
8: {
9: Document document = new Document(PageSize.A4, 80, 50, 30, 65);
10: PdfWriter.GetInstance(document, new FileStream(pathToPdfFile, FileMode.Create));
11: document.Open();
12: document.AddTitle("PDF from vikrampant.com - " + p.Title);
13: document.AddAuthor("Vikram Pant, http://www.vikrampant.com/");
14: document.AddCreationDate();
15: HtmlParser.Parse(document, pathToHtmlFile);
16: document.Close();
17: }
18: catch (Exception)
19: {
20:
21: throw;
22: }
23:
24: return pathToPdfFile;
25: }
Once this function runs, the PDF path is returned from the calling function. Besides the strict HTML requirement, this is a fairly simple process. I plan on looking at other ways to generate PDFs from webpage. If you’re ok with paying a few hundred dollars, then you can look at activePDF’s WebGrabber.
The final snippet of code I used for this iText + BlogEngine PDF on the fly function is to check the Post modified date with the file creation date. The use case is if my post already has a PDF rendition, why make a new one. But if a PDF doesn’t exist, or if the post has been modified since the PDF was made, then recreate it.
1: DateTime pdfCreatedDate = File.GetCreationTime(@pathToPdf);
2: DateTime postCreatedDate = this.Post.DateCreated;
3: DateTime postModifiedDate = this.Post.DateModified;
4: if (pdfCreatedDate.CompareTo(postModifiedDate) < 0)
5: {
6: //Less than zero
7: //t1 is earlier than t2.
8: //Zero
9: //t1 is the same as t2.
10: //Greater than zero
11: //t1 is later than t2.
12: pdfFileValid = false;
13: }
14:
15: ...
16:
17: if (!pdfFileValid)
18: {
19: string pathToHtmlFile = PrintPDFFunctions.ConvertPostToCleanHtml(this.Post);
20: string pathToPdfFile = PrintPDFFunctions.ConvertCleanHtmlToPdf(pathToHtmlFile, this.Post);
21: System.IO.FileStream myFileStream = new System.IO.FileStream(pathToPdfFile, System.IO.FileMode.Open);
22: long FileSize = myFileStream.Length;
23: byte[] Buffer = new byte[(int)FileSize];
24: myFileStream.Read(Buffer, 0, (int)FileSize);
25: myFileStream.Close();
26: Response.Clear();
27: Response.Buffer = true;
28: Response.ContentType = @"application/pdf";
29: Response.BinaryWrite(Buffer);
30: Response.Flush();
31: Response.End();
32: }
33: else
34: {
35: System.IO.FileStream myFileStream = new System.IO.FileStream(pathToPdf, System.IO.FileMode.Open);
36: long FileSize = myFileStream.Length;
37: byte[] Buffer = new byte[(int)FileSize];
38: myFileStream.Read(Buffer, 0, (int)FileSize);
39: myFileStream.Close();
40: Response.Clear();
41: Response.Buffer = true;
42: Response.ContentType = @"application/pdf";
43: Response.BinaryWrite(Buffer);
44: Response.Flush();
45: Response.End();
46: }
The reason I am attempting to go this route is because many people host sites on third party webhosts and don’t have access to the underlying OS. For this, it’s drop in the iTextSharp DLL, a CS file and a aspx page. But because I host my website in my garage and have access to the OS, I may just try creating a solution that calls Acrobat on the server or something.