I was posed a little request which begged a prototype. And I do like R&D. What this time?
Take an entire blog and create a well formatted PDF (based on some loose presentation rules).
I’m playing with ideas as I write. OK, yeah yeah, “I’m now resting” as I collect my thoughts.
Actually “Blog” is a rather generic term. There are many blog engines, for example Blogger, and one that I happen to use right now – WordPress. And I fear that each may have it’s own way of extracting data. Hmmm, warning bells – need to use a high level tool, if at all possible.
At first I thought, sure, iterate though all the posts and just grab a PDF of each, site formatting and all. Nice generic ‘low coding’ solution. Using the likes of web2PDF. Well, the resulting output could be potentially humongous, and it was. PDFs are for actual paper printing after all.
To get down and dirty as quickly as possible, just to see what could be done, I chose to target a Blogger blog. Why? Simply because I had access to one that wasn’t mine (my wife’s, and much better to break her site than mine ;-), and because Blogger are owned by Google, and I happen to be interested in the Google APIs. Aside: I’d even set up “Google Apps” for this company site, Lexecorp, for staff mail, calendar, docs,…
Anyway, so how does one get info out of a blog? I had a vague idea of manipulating the remote blog site with something like webTest’s clickLink, clickButton, clickElement, yada, yada and XQuery to specify the resource. Nah, that’s just what was near the surface of my brain pit. Just too time consuming.
As I had Blogger in the crosswires, I did a broad search and found the Blogger GData API. Comes under the banner of Google Data (GData) APIs. There was the option to go for Python, but I took the Java route, and read this intro.
I then returned to the Blogger specific APIs and followed this guide. Note that they live in the Labs so expect them to be potentially clever but brittle.
I installed the dependencies and was going to write some code. But urghh, the ant xml files were just too complicated. Using just Emacs didn’t cut the mustard, so I set up a new vanilla Java Project in Eclipse, with the dependencies I had been told to satisfy.
Then I tried to write some code and got this
“Caused by: java.lang.ClassNotFoundException: com.google.common.collect.Maps”
which I resolved from here. Just the extra core dependencies that the doc doesn’t mention…
And so I get this code, initially based on the samples code that one can also download.
// Using gdata api
// to get a complete blog post list
import com.google.gdata.client.GoogleService;
import com.google.gdata.client.blogger.BloggerService;
import com.google.gdata.data.Entry;
import com.google.gdata.data.Feed;
import com.google.gdata.util.ServiceException;
import java.io.IOException;
import java.net.URL;
public class BlogList {
/**
* @param args
*/
private static final String METAFEED_URL =
"http://www.blogger.com/feeds/default/blogs";
private BlogList() {
// do nothing
}
private static String getBlogId(BloggerService myService)
throws ServiceException, IOException {
// Get the metafeed
final URL feedUrl = new URL(METAFEED_URL);
Feed resultFeed = myService.getFeed(feedUrl, Feed.class);
// If the user has a blog then return the id (which comes after 'blog-')
if (resultFeed.getEntries().size() > 0) {
Entry entry = resultFeed.getEntries().get(0);
return entry.getId().split("blog-")[1];
}
throw new IOException("User has no blogs!");
}
public static void printAllPosts(GoogleService myService, String blogId)
throws ServiceException, IOException {
// Request the feed
URL feedUrl = new URL("http://www.blogger.com/feeds/" + blogId + "/posts/default?max-results=500");
Feed resultFeed = myService.getFeed(feedUrl, Feed.class);
// Print the results
System.out.println(resultFeed.getTitle().getPlainText());
int postCnt = resultFeed.getEntries().size();
System.out.println("Retrieved " + postCnt + " blog posts");
for (int i = 0; i < postCnt; i++) {
Entry entry = resultFeed.getEntries().get(i);
System.out.println("\t" + entry.getTitle().getPlainText());
}
System.out.println();
}
public static void run(BloggerService myService, String userName,String userPassword)
throws ServiceException, IOException {
// Authenticate using ClientLogin
myService.setUserCredentials(userName, userPassword);
// Get the blog ID from the metatfeed.
String blogId = getBlogId(myService);
System.out.println("blogId = " + blogId);
//feedUri = FEED_URI_BASE + "/" + blogId;
// feed query.
printAllPosts(myService, blogId);
}
public static void main(String[] args) {
// TODO: add args parser l8ta
String userName = "theUserName";
String userPassword = "thePassword";
BloggerService myService = new BloggerService("lexecorp-bookApp-1");
try {
run(myService, userName, userPassword);
} catch (Exception e) {
e.printStackTrace();
}
}
}
The code taught me that there are BlogID’s for Blogger. And in essence that I should target an RSS feed – the prime solution at this moment. At least it will remain so if I can get pictures too (I want all the blogger’s images to go neatly into the final PDF).
Output?
blogId = xxxxxxxx
A Day in The Lives of xxxxx & xxxxx
Retrieved 279 blog posts
Farmstay
Friends
....
Baby Shower ?Pre-bubba)
And it all boils down to
http://www.blogger.com/feeds/’theTargetBlogID’/posts/default?max-results=500
Wasn’t sure about how to say gimme all posts (I think “-1″ signifies that with WordPress), but the max-results was big enough for a test, and more than what my wife had posted thus far.
But this is just an XML stream! Hmmm, now knowing Groovy, I thought XML Slurper.
Actually I still have issues with Groovy performance. This discussion here hits the nail on the head. But in this instance we will be net I/O bound, so who cares.
But looky here. Yes, note the programmer productivity! This is a much better way to play with the XML feed.
def url = "http://www.blogger.com/feeds/xxxxxxxx/posts/default?max-results=500"
def feed = new XmlSlurper().parse(url)
println "Author = $feed.author.name"
println "Title = " + feed.title
println "Sub Title = " + feed.subtitle
def posts = feed.entry
println "posts count = " + posts.size()
posts.title.each { it ->
println "\t $it"
}
// HOW to parse <openSearch:totalResults>273</openSearch:totalResults> - Use "quotes" in some fiendish way ?
//println "totalResults = " + feed.openSearch // at least posts.size() works
0
well it doesn’t login, yes, but “does one need to?”. If I need to automagically suss out what blogs an owner has, then maybe the two approaches (Gdata and Groovy’s XML Slurper) are complementary. Java and Groovy, really it’s just too easy to blend the two.
I’ll have to see where this takes me as I explore further. But assuming I can deduce the blogId by some other means, then the results are so far the same, and for a lot less code in the Groovy case. The same output, and more!
Author = Xxxxx and Xxxxx Title = A Day in The Lives of Xxxxx & Xxxxx Sub Title =Xxxxx&Xxxxxlive in ...posts count = 273 Farmstay Rainy days of Xxxxx & Xxxxx ... Baby Shower ?Pre-bubba) totalResults =
I’ve obviously some XML parsing to suss out, maybe double byte language handling (from the output I noticed a lot of “???” for Asian fonts – my wife blogs in English and Japanese simultaneously), grabbing the images… But all in all this looks like a very promising line. End of the chain I’ll probably use iText to generate the PDF directly.
Will post more to this here my “grey matter backup area” as I get further. More fun than you can shake a stick at!


