Monday, November 22, 2004

Linux and PDAs / Phones

Well my pain about what to buy in regards to pdas or phones has just got worse with the introduction of these!

New machine

Well it took a lot more effort then I thought but I finally have everything up and running on the new machine. I have even changed the architecture of my home network as well. The problem for the delay was in how I was connecting to the net. In the past I was using a usb modem which, while not a quality modem, did the job. After mirroring the old terra, I thought it should be as simple as bringing up the new box and away we go, ofcourse nothing ever is that simple :) The key problem was simply regardless of what I did I couldn't get the new machine to talk to the usb modem. At first I assumed it was just my kernel config so I spent some time rebuilding kernel after kernel, changing configs and then kernel versions around. Further research showed that the problem was the driver that powered the usb modem under linux, eciadsl, did not support non uhci usb interfaces and my new machine ofcourse didn't have any of those. After trying more kernels (I was told this problem "possibly" went away under the latest 2.6.10pre kernels, it didn't) I got fed up with the whole setup and decided that I would get some kind of dedicated device to do the internet connection. Now normally I shy away from this because I want to terminate my net connection ON a linux machine, as I have more confidence in a linux machines security and stability then I do some black box, additionally terminating it on something else means that I lose an IP, something that is problematic for me as I am short of real IPs in the first place. After some more research I realised that I could setup a bridging mode, where basically I have a adsl to ethernet converter (pure bridge) that connects to my linux box that acts as a gateway. The win with this type of setup is that I maintain my single access point (and consequently a single point of security enforcement and monitoring) but don't lose any IPs. It sounded great so after borrowing a cisco router from work I asked one of the hardcore network guys to help me out with a configuration..... 6 hours later I got a "its theorectically possible but I don't think I can get it working" reply. Hmmm. In the end it is up and running without the bridging mode so I have lost one IP, and that will do until I move to my new place and get a chance to set it all up correctly. Ofcourse what it did mean was that I had to change the IP of my main server (due to conflicts with my router) and so everyone was unable to reach the network for a longer period of time then I would have liked. Still its all up and happy now ... at least until I move house sometime in the next month.

Wednesday, November 10, 2004

Uptime

It seems appropriate after my recent blog about downtime to have one about uptime, especially as the lack of it recently has seen my frustration levels rise. While there are an ever increasing amount of "admins" in IT, it seems that there are an ever decreasing amount of enterprise admins. By enterprise I mean admins that know how to look after large sites and ensure uptime. That doesn't mean just making sure that the boxes are patched and functioning, but also that the design of the network and services can scale, respond to short, sharp spikes in load and generally be reliable at all times. I mean admins that know what it's like to lose thousands of pounds per second when your service is down and know how to avoid it. I say this because of two experiences in the last two days.

The first experience there was with world of warcraft. This is a new game that is set to be the next BIG thing in MMORPGs. Now for some time, Blizzard, the company that makes WoW, have been saying that they will have an "open" beta, meaning that the general public will get a chance to test the game before Blizzard marks it as ready for general consumption and final release. This is a great opportunity for people who might be interested in the game and try it out, people like me :) Now the problem with this is that the client to connect to WoW is 2.5gb, and when you have thousands of testers wanting to play that is a LOT of bandwidth needed to give each user a copy. Blizzard being scared of trying to host something like that themselves took a different route. What they did was told fileplanet that if they hosted the signup for the open beta, they would give fileplanet exclusive rights over the beta by giving them a whole heap of keys (that is the ability to actually get INTO the open beta, because while it is "open" in reality there were limited amount of positions to be filled). Fileplanet accepted and promptly started offering deals to "subscribe to fileplanet" and at the same time get "free entrance into the WoW beta". A wonderful marketing opportunity for them and Blizzard doesn't have to host it. A win all round yes? well actually NO. The problem was quite simple, despite the fact that fileplanet had obviously sold Blizzard on the idea based on their ability to host something like this, they quite clearly didn't have the skills to do it. Let me explain.

Now to jump forward a bit it turns out that the idea of anyone trying to serve out 2.5gb of data to a huge amount of yearning gamers frightened everyone involved so they came up with an interesting way around the problem by making the client you used to download the game client a custom built bit torrent client. For those of you that don't know what bit torrent is I encourage you to go its site and check it out, but for the purposes of this rant its enough to say how it works is that instead of having to download the 2.5gb from one server run by a company, instead you end up grabbing the file from other people who are downloading it as well thus allieviating the load on the companies servers. Now bearing this in mind we see fileplanet only had to do a few things:
1. provide an interface to sign up to the open beta. This was in the form on ONE webpage form which took some details you put in and dumped them into a database.
2. send the new signee a client with which to download the game (the aforementioned bittorrent client).
Now I happened to be up when the open beta finally went live and so I immediately headed over to get my key and sign up. Imagine my surprise when the company that had sold this entire idea to blizzard because of the power and size of their site, had fallen into a complete heap. For a start the web page just stopped, I mean completely halted. Then the errors started. It turned out the entire fileplanet site is run on some stupid conglomeration of MS .NET technology that quite clearly was simply not up to the task of servicing that many requests. Over the new 30 minutes I saw errors from the .NET technology, errors from the webservers, errors from the database, errors from protocols (ie timeouts) and just about everything else that could break did. Whats more you could almost SEE their technicians running around randomly restarting things and by the time they had moved on to the next broken component the first one had died under the load because it wasn't able to talk to anything else ( any decent admin would take the entire site offline, do what is needed then bring it all back at once). It would have been comical if it wasn't for the fact that I was trying to get some information out of this mess. The site was still useless a solid 8 hours later when I got up (on a different note I was able to get my key but that was because I sat there spamming a query into their DB when I knew it finally came up for a few seconds :). The whole incident really killed any faith I had in even large sites doing things "right".

The next incident that fired me up was the release of Firefox version 1 yesterday. Now this is the long awaited release of what I personally think is the best browser around right now and so I was keen to get my hands on the new version. Much to my chargrin when I went to www.mozilla.org it had slowed to a crawl, it took over 3 minutes to render one page. I was quite shocked because I always tend to think better of my free software friends then those that run MS apps but it seemed that the problem was not specific to an architecture but rather, simply a lack of good admins (I group architects in the group as admins as a good admin will do both). Now to be fair to mozilla.org it is possible that the load they experienced was simply NOT feasible to plan for, and at least their site was still working, albeit incredibly slowly. It is also possible that they knew their site would not deal with the load and that their design was near perfect BUT that they didn't have the money to setup the right infrastructure to deal with it, but I don't think so. Fileplanet certainly doesn't have that excuse as they are a commercial organisation dedicated to doing events like the WoW openbeta.

Where are all the admins?

Sunday, November 7, 2004

New box

Well after looking at some more kernel panics with a friend we came to the conclusion the problem was not the usb kernel code as I had thought, but rather it looked like the hd. With that in mind I then disabled my software raid 1 setup, or at least I thought I did. It turned out that although I had disabled it at the software level when I rebooted it had "autodetected" the raid 1 configuration due to superblock modifications by raid and had booted up with raid anyway! I then rebooted passing the raid=noautodetect option to the kernel only to find that now lvm was seeing duplicate PVID's and was effectively using both HD's anyway! At least I know that I had a really resilliant setup :) Anyway after finally turning off all raid / balancing I was able to boot off one HD and see how that went under the impression that if it didn't fail it was the good HD. Ofcourse it promptly failed so I booted off the other HD now confident that it was the good HD. You can imagine my perplexion when it promptly failed even more spectacularly. Now I was left with the idea that the problem might have been one of the following:
* bad CPU
* bad motherboard (as mentioned previously there were some broken fans on the MB)
* bad ram - after all the problem was transient which is the best indication of ram, and often happened when compiling.
* bad hd - both were old scsi, in fact I have such an old scsi hd in it that its 4gb!
* bad compile options - I was using gentoo hardened which has some hardcore compile options, and then I had broken their guidelines and optimised that further. Finally I had reverted back off that and had cross compiled some binaries from my other amd box.

The more I thought about it the less sure I was that anything was fucking working as advertised, so I took the tried and tested option - I bought a new computer :) I briefly agonised about using this as an excuse to go the "full monty" and upgrade to 64 bit, but after thinking about it and talking to a friend I realised that it would be best to simply get an ultra cheap, but still very powerful "normal" upgrade. So as I write this the confirmation orders are coming in from komplett and ebuyers. The new terra will be a athlon xp 2800, 1 gig ram + other assorted goodies. I hope to have the parts and be building it by next weekend, now its over to the Brittish mail system. sigh

Saturday, November 6, 2004

Downtime

Problems with my main server are persisting. It is now seemingly having a usb related kernel panic every other day. I am still trying to work out what I am going to do with my disciplina setup. I think for the interim I am going to move some services to a stable box based in Australia. I will keep you all informed.

Friday, November 5, 2004

Gaming

Being a hardcore gamer at times I found this article very interesting. It's interesting reading the old game designers opinions on things, there is such a different mindset from the old table top role players to the new, latest and greatest "hit" from a modern computer gaming studio.

Thursday, November 4, 2004

Free Software

I read an article today. It was the usual war of words about free software and its sustainability. Groklaw has an editor that thinks this is the be all and end all of emphatic responses, in particular he is impressed by the clarity and style of writing that the refutation uses. I invite you to go and read both articles and form your own opinion, but I personally found the "attack" to be almost unreadble due to excessively vague and misplaced arguments and the "rebuttal" to be slightly more clear but to also be repudiating points that I didn't even consider as valid in the first place. An interesting read on this subject is the oreilly's book about Richard Stallman, it is completely free (ofcourse) and available online here.

Wednesday, November 3, 2004

Hardware woes

I have always thought that I have been relatively lucky when it comes to hardware. I have been using computers extensively for the last 20 years or so and I have never had a failure that has resulted in me losing all my data or having to completely replace a computer. Yet recently I am beginning to think that I am cursed and that while I have not had a failure that is decisive, that perhaps that is WORSE then what I am experiencing. Let me explain.

Currently I have 2 computers that I am using to host various things (I have others scattered around the world for redundancy but essentially the active services are run on these 2 boxen) and just when I get something setup and others start to use / rely on it something breaks. Now it is never anything major, an example might be my CPU. By "break" I mean that sometimes when doing some compiling I get an error from GCC indicating that the compilation failed due to a hardware error. At first I suspected my memory, but I have no reason to believe the RAM that has been fine for years should suddenly fail, whereas the cpu has been stressed quite severely over that period of time and some of the fans on the mainboard have failed (no not the cpu fans) and so I thought that it was simply an issue of overheating. After some cooling tests this doesn't seem to be the case as I can reproduce the gcc errors quite reliably after I have left the machine off for some time or have cooled it, whereas I can still compile other things equally CPU intensive with no problems. This doesn't sound like a major problem until you start to combine it with other things.
My main box connecting to the net is using a usb modem. I actually think that this is a good thing these days as using a seperate device, like say a adsl router, means that you are relying on a black box for you security at the net termination point which is a bad thing IMO. So while I am happy that I am connecting my net connection straight into my hardened server / router it causes a problem as the modem is usb. Currently the drivers for my modem require that usb be built as a module in the kernel which not only further increases security concerns but also means that there is a lot of loading and unloading the driver when it initially connects. Now it turns out that there are a few additional problems, firstly unix usb code is in general shit (not specifically linux) and that the code for my usb modem is not much better. So now we have a situation where we have an unreliable usb module with an unreliable driver being inserted into my kernel with some other unreliable issues like my cpu. The results are the usb section of my kernel often panics and the box locks ... this is the same box that is running all my active services. Now yes I _could_ run all my active services on my other box, my desktop we will call it. The issue with that ofcourse is that it is my desktop, which means that I use it as a testbed for all kinds of things (I am doing more and more gentoo development work and end up running such alpha quality stuff that I invariably kill it, additionally, while it is increasingly rare these days on the odd occassion I like to play some games so that means a reboot into windows and consequently downtime on all the services anyway. What this means is that I need to get a new box and migrate the important services off of the older one onto my current desktop and turn my current desktop into the new main server. Ofcourse this is complicated by the fact that I don't really NEED a new box and I particularly don't want to buy one right now when the computer industry is going through a hardware change the likes of which we have not seen since when the first pentiums were produced. I am referring to the change from 32 bit to 64 bit on the home computer, the change from atx to btx form factor and the change from pci to pci-x. I have said a lot of times that I will only buy once all of these things are readily available and hopefully the price premium has dropped a little :) So in the interim I just keep trying to keep my old server working and try to ignore the jibes from my friends about "reliability" and me being an enterprise admin ...