PickledHedgehog.com | MuntedharAlhakim.com | MuntedharAlhakim.net PintOfKittens.com

What?

The PickledHedgehog Page Downloader is exactly what it says on the tin - it will (when its finished) download a webpage (that you specify) to the directory specified in the web config. Im also aiming to make this open source, so that you can improve it!

Why?

I used to bookmark all usefull pages, but then this gets very tiresome, and sometimes the page goes offline etc...this way it will just save the page, and the image, and the css (hopefully) to your desktop, or wherever else you choose.

How?

The "spider" for lack of a better word uses the HttpWebRequest:

        private static string GetPageResponse(string URL)

        {

            HttpWebRequest request = (HttpWebRequest)WebRequest.Create(URL);

            request.KeepAlive = false;

            request.Referer = URL;

            request.UserAgent = "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 6.0)";

            HttpWebResponse response2 = request.GetResponse() as HttpWebResponse;

            StreamReader reader = new StreamReader(response2.GetResponseStream());

            return reader.ReadToEnd();

        }


Yes, I spoof the UserAgent. Thats mainly because sometimes, webhosts dont like the .net process accessing thier website. If you can think of a better way to do this, gimmie a shout!

Once I have the stream output, I use a Regex to extract the images:

            Regex r = new Regex("src[^>]*[^/].(?:jpg|jpeg|gif|png)", RegexOptions.Multiline & RegexOptions.IgnoreCase);


And hey presto! I just have to do a lil prefixing here and there for the fully qualified URL, then I go ahead and download them too!

Alpha? Beta?

Since, at the time of writing this, I have JUST started working on it (I have a very early prototype ready), I think im gunna wait and develop this further before I go ahead and unleash this onto the world. If you think you want to try it, drop me a mail.


* Edit - In hind sight, "resposne2" isnt such a good name for a variable, I really should change it to something more suitable, like response, or webResponse, same for request - should be webRequest*