Javascript and dynamic content loading with Ajax has changed the way web is being used. User interfaces have advanced rapidly and the “web content” is becoming more and more complex. This, however, presents whole new kinds of problems for parties other than the expected end user. Web content is intended to be read with a graphical browser, such as Firefox or Google Chrome. Javascript capabilities are taken for granded, as they should, as we live in the modern ages
However, sometimes there still is a need for reading the content programmatically. Ajax content loading can become quite a pain in the ass in such situations, as I now have learnt.
In most cases, content retrieved by Ajax after the normal page is being loaded, isn’t a problem. If you really need the content, you can usually see how the Ajax call is made and use the same url for getting the data directly. In Most Cases. Sometimes the Ajax calls are not so obvious; the javascript might be obfuscated or you can’t really tell where the data is being loaded from. Or sometimes you just need to simulate the actual page loading. Test automation could be one of such situation. None the less, the problem is loading the actual data. Before Ajax-times, you could just use wget, or maybe a library provided by your programming language of choice (Python – urllib2) – to fetch the HTML source. Javascript changed this, as almost nothing, except the actual web browers, can execute it while reading the source.
Few days ago I faced such a problem. I spent most of my day figuring out how to read Javascript-rich web 2.0 page content using only tools available in shell environment. I preferred something I could use with Python, but nothing seemed to work. As I had almost lost all hope of achieving my goal, I decided to take one last look of PyQt4. I knew QT had a webkit component but I didn’t know if it can be used in a non graphical environemnt or if PyQt4 supported it. It can but it cannot and it does but it does not…
When I first tried PyQt4 it seemed to include WebKit module. But it didn’t. After a long session of banging my head against the wall I realized that even though the module QtWebKit was there, it didn’t actually do anything. It just failed silently when I used it. Then I came across this: http://www.insecure.ws/2008/09/16/xserver-less-webpage-screenshot#comment-239. It seems version 4.4.2 of PyQt4 is sometimes build without support of WebKit. Guess which version I was using? I updated my PyQt and got immediately some results. I also found out that QWebPage can be used in a widgetless situation. I also learnt that PyQt crashes when you create such object without first starting QtGui app by: QtGui.QApplication(sys.argv). It doesn’t work if you pass the second argument as False, for starting a non-Gui QT application. This seemed like a deal breaker to me, even though I was able to render the complex html with ajax calls just correctly with all the data included. It bugged the hell out of me. I drew no windows, no dialogs, nothing graphical. And yes, you cannot run a PyQt GUI application in shell, even though it produces no graphics. There was no physical QWebPage to be seen, everything happened on the background, why the hell it didn’t work as non Gui Qt Application? I never found an answer for this, but I found a way around it: XVFB, thanks to the post I referenced earlier.
Xvfb is a Virtual Framebuffer for X. What a life saver. I don’t know how it does it what it does, but basically it fakes a Display in linux environment where X is not available. I guess it directs everything to /dev/null or something, which in my case was just perfect, as I didn’t even want to produce anything graphical. My only interest was the HTML content, in textual format. So now, with the help of Xvfb, I can run “graphical programs” in shell and render my complex web pages programmatically, and parse the results as I please. Maybe my solution is not perfect and maybe there is some cleaner way of doing it, but I couldn’t find it. If you know, please share! Until that, I’m quite happy running my PyQt application (even though it seems to be quite heavy for such a basic task).
After I gave PyQt a shot, I did some googling and found out that others have used the same approach as me. Here’s one link:
http://aezell.wordpress.com/2009/02/13/screenshot-a-url-with-python-and-qt-and-webkit/
funny how this idea has been taken from this insecure.ws website but so rarely referenced
following the pingback they all ping a different page like “i found it first here” then mostof th eold pages reference this insecure.ws site
by the way, you should try that code when you have problems with python:
http://www.insecure.ws/2009/04/02/xserver-less-webpage-screenshot-c
Yeah, it’s sometimes difficult to trace back to the original source and I’ve decided to just link to the pages I’ve stumbled across myself so people can find out by themselves if they’re interested. Maybe I’ll add one more link, as finding out about xvfb was the final nail in the coffin for me, and I think that’s when I first came across those PyQT/Screenshot posts. I would’ve probably found them sooner if my goal was to get the screenshot instead of just dynamically generated source.
Regarding your link for using pure QT/C++ to render the page, dang, why I didn’t think about that? I can definitely use that approach myself, all though in my case I still probably hand the content parsing to my Python script. I guess I was blinded for my desire to use Python over anything else. Thanks for the link and the idea!
I have now rewritten my page fetching tool with pure QT/C++ instead of PyQt, but it performs basically the same. PyQt version used a little more memory, but that’s about it.
One another thing: I actually forgot one of the most important things from my original post regarding fetching ajaxed content. You need to have some sort of mechanism for waiting the dynamic content to load. I set up a thread for this, which waits a bit and then re-reads the html source provided by QFrame. I use “trigger phrases” to check if the dynamic content is really loaded and then proceed. In my case, the original source has text “Loading…” which will be replaced by the actual content so it’s pretty easy to check. You gotta love BeautifulSoup. I’d go insane if I’d had to do html parsing in C++ (but I’m sure there are suitable libraries for that too).