Geek stuff

AVL Tree in Python

I’ve been teaching “Applied Algorithms and Programming Techniques” and we just reached the topic of AVL Trees. Having taught half of the AVL tree concept, I decided to code it in python — my newest adventure. Bear in mind that I have never actually coded an AVL tree before and I’m not particularly comfortable with python. I thought it would be a good idea to experiment with both of them at the same time. So, I started up my python IDE (that’s Aptana Studio, btw) and started coding.

For the newbie programmer, the code itself may not be very useful since you can find better code online. The benefit is in being able to look at the process. You can take a look at the commits I made along the way over here on github. You can take a look at how I structured the code when I began and how I added bits and pieces. This abstraction should help in solving other problems as well. The final code (along with a rigorous unit test file) can be seen here: https://github.com/recluze/python-avl-tree

Creating UML Sequence Diagrams with TikZ in LaTeX

I’ve been working on my LaTeX skills for some time. The goal is to move towards an all-latex solution for writing research papers, slide sets and other documents. I’m almost there. An important goal was to be able to create all sorts of figures within LaTeX. (Well, originally, the goal was to use open source softwares to create them but it turns out that LaTeX is really good at this stuff.) The package that I’m using for graphics creation is TikZ. Here we’ll cover how we can create sequence diagrams using TikZ and a plugin package.

Here’s what we’re planning on creating.

Sequence Diagram using TikZ (click to enlarge)

First, you need to get the pgf-umlsd.sty file from over here and put it in a folder. Then, create your minimal working example (main document) using the following code.

\documentclass{article}
\usepackage{tikz}
\usetikzlibrary{arrows,shadows}
\usepackage{pgf-umlsd}

\begin{document}

\begin{sequencediagram}
\newthread[white]{u}{User}
\newinst[3]{b}{Browser}
\newinst[3]{t}{TPM}
\newinst[3]{p}{TTP}

\begin{call}{u}{Init()}{b}{}
\end{call}

\begin{call}{u}{AIKAuthSecret}{b}{}

  \mess{b}{verifyAIKAuthSecret}{t}

  \begin{call}{b}{get AIK$_{pub}$}{t}{AIK$_{pub}$}
  \end{call}

\end{call}
  \begin{sdblock}{Loop}{}

    \begin{call}{u}{Do Something}{p}{AIK$_{pub}$}
    \end{call}
  \end{sdblock}
\end{sequencediagram}

\end{document}

As you can see, the pgf-umlsd package is really awesome. You first create a new thread using the syntax \newthread[color]{variable}{ClassName}. Then, you create instances using \newinst[distance]{variable}{InstanceName}. Afterwards, use call environment to specify calls. All you need to do is specify the caller, the message label, recipient and return label. Messages are similar and can be done through the \mess command. You can insert blocks using the sdblock environment. All it needs is a label and a description. A block will automatically surround everything within this environment. Oh, and calls can be nested.

If you’re like me and don’t like your object names underlined, you can pass the [underline=false] option to the pgf-umlsd package.

p.s. Your output may be a little bit different from mine because I modified the package file to suit my personal liking — just a bit though.

LaTeX Screencasts

I’ve started putting together a couple of screencasts for those who want to start working with LaTeX. These are aimed at the extreme newbie who wants to learn the basics and get up to speed with the typesetting tool. I’ll be updating this post as I put more videos online inshallah. For now, see the videos below or on Youtube. For best results, view in HD at full screen.

Part I: Introduction

Part II: Creating your first document
Download the files mentioned in the video from here: latex-seminar-files-demo-1.zip

Part III: Bibliographies, Class Files for Conference Styles
Download the files mentioned in the video from here: latex-seminar-files-demo-2.zip

Site Re-design

New Site Design

Last time I did a custom re-design for my site was way back during my blogspot time. That was in 2006 — five years have passed but I still like the design. When I moved to wordpress.com, I didn’t have a way of creating my own design so I stuck with the best design I could find. I moved to my own host here at CSRDU last year but didn’t really feel the need to create a custom design. The result, even with the great theming mechanism provided by WordPress, I never wrote a custom theme for my site. I always stuck with existing freely-available themes that always left me wanting more in one department or another. Either the typography wasn’t up to par or I didn’t like the comments layout. So, I always had to settle with whatever I could find.

That changed a couple of days ago when I came across a typography post on some blog which inspired me to begin my own wordpress theme. I had one clear goal in mind — improve readability. People come to my site mostly to read the tutorials. It’s not fair if I give text secondary importance focusing on the layout. So, I started customizing the sandbox wordpress theme. It has the cleanest markup and I was able to make all the changes simply through a custom CSS. I went with a fairly large serif font (Georgia) for the content with a sans-serif (Open Sans) font coming from Google Webfonts for the post titles. I also have a slight text shadow effect but it wont’ be visible if you’re using IE. There’s only around 5 images in the whole theme plus two fonts. So, the overall result is a fairly lean page with clear fonts and layout.

As always, all comments and criticism is most welcome.

Google Has Messed Up Social Once Again

I know I might be in the minority right now but that’s how I feel. It seems Google has learned little from Wave and Buzz. Here’s what I think has gone wrong this time.

First, Google engineers have probably never heard of the phrase, “less is more”. They tried doing everything with Wave — everyone knows how that turned out. They’re doing the same thing with Google+ (or Google Plus). It’s twitter, friendfeed, skype, facebook and slashdot all rolled into one. The problem is, I don’t know which one I’m using when I navigate to the G+ interface.

I know I can divide people into circles and keep them separate but I don’t know if I can keep track of it all. I have separate ‘circles’ for friends and ‘professional connections’. Most often, though, I want to share a thought with both of them so I just post that to the  ’public’ circle. My friends, goofy as they are, start commenting on the post and it quickly turns into a dorm room crap fest. That’s not the ‘professional’ image I want to project — that was the whole point of circles. The solution, post the same thing twice, once to public and again to friends. But then, why don’t I just go over to twitter and post there?

That, I think is the core of the problem. Why would anyone want to use Google+ — after the initial awe of the cool interface for dropping your friends in a circle subsides? For sharing news — I already have a neat little twitter account for that. It’s streamlined and it does what it’s supposed to do. When I’m there, I know what I’m there for. I don’t get distracted by comments from my goofy friends. Well, how about keep tabs on my friends? I don’t use facebook myself but last time I checked a lot of people were already using that social network. Just as people haven’t jumped the Yahoo! mail ship despite the immense impotence of Yahoo!, I don’t see why they’d move everything over from Facebook over to Google+. Not everyone likes to play with new and shiny geek toys.

And that brings me to the second point: Google engineers just can’t shake the geek within them. They think everything will be adopted if it’s similar enough to Gmail. They tried doing this with Wave. They did the same thing with Buzz, integrating it too tightly with Gmail and that was a fiasco. Now, they’re doing this with Google+. It’s all about how cool the technology is. They’re going to release the API soon.  That’s all great but what about the social aspects? I don’t see any incentives for moving away from my existing social networks — except maybe Buzz. So, I don’t think Google+ is a facebook killer or a twitter killer. It might be a  Buzz killer but that too is a maybe.

A Basic Naive Bayes classifier in Matlab

This is the second in my series of implementing low-level machine learning algorithms in Matlab. We first did linear regression with gradient descent and now we’re working with the more popular naive bayes classifier. As is evident from the name, NB it is a classifier i.e. it sorts data points into classes based on some features. We’ll be writing code for NB using low-level matlab (meaning we won’t use matlab’s implementation of NB). Here’s the example we’ve taken (with a bit of modification) from here.

Consider the following vector:

(likes shortbread, likes lager, eats porridge, watched England play football, nationality)T

A vector x = (1, 0, 1, 0, 1)^T would describe that a person likes shortbread, does not like lager, eats porridge, has not watched England play football and is a national of Scottland. The final point is the class that we want to predict and takes two values: 1 for Scottish, 0 for English.

Here’s the data we’re given:


X = [ 0 0 1 1 0 ;
1 0 1 0 0 ;
1 1 0 1 0 ;
1 1 0 0 0 ;
0 1 0 1 0 ;
0 0 1 0 0 ;
1 0 1 1 1 ;
1 1 0 1 1 ;
1 1 1 0 1 ;
1 1 1 0 1 ;
1 1 1 1 1 ;
1 0 1 0 1 ;
1 0 0 0 1 ];

Notice that usually when we represent data, we write features in columns, instances in rows. If this is the case, we need to get the data in proper orientation: features in rows, instances in columns. That’s the convention. Also, we need to separate the class from the feature set:

Y = X(:,5);
X = X(:,1:4)'; % X in proper format now. 

Alright. Now, that we have the data, let’s hear some theory. As always, this isn’t a tutorial on statistics. Go read about the theory somewhere else. This is just a refresher:

In order to predict the class from a feature set, we need to find out the probability of Y given X (where

X = ( x_1, x_2, \ldots x_n )

with n being the number of features. We denote the number of instances given to us as m. In our example, n = 4, m = 13. The probability of Y given X is:

P(Y=1|X) = P(X|Y=1) * P(Y=1) / P(X)

Which is called the Bayes rule. Now, we make the NB assumption: All features in the feature set are independant of each other! Strong assumption but usually works. Given this assumption, we need to find P(X|Y=1), P(Y) and P(X).

(The weird braces notation that follows is the indicator notation. 1\{ v \} means use 1 only if condition v holds, 0 otherwise.)

P(X) = P(X|Y=1) + P(X|Y=0) P(X|Y=1) = \prod_j{P(x_i|Y=1)}

To find P(X|Y=1), you just have to find P(x_i|Y=1) for all features and multiply them together. This is where the assumption comes in. You need the assumption of independence here for this.

P(x_i|Y=1) = \sum_j{1\{x_i^j = 1, y^j = 1\}} / \sum_j{1\{y^j = 1\}}

This equation basically means count the number of instances for which both x_i and Y are 1 and divide by the count of Y being 1. That’s the probability of x_i appearing with Y. Fairly straight forward if you think about it.

P(Y=1) = \sum_j{1\{y^j = 1 \}} / \sum_j{1\{y^j = 1, y^j = 0 \}}

Same as above. Count the ratio of Y=1 with the total number of Ys. Notice that we need to calculate all these for both Y=0 and Y=1 because we need both in the first equation. Let’s begin from the bottom up. For all of below, consider E as 0 and S as 1 since we consider being Scottish as being in class 1 (positive example).

P(Y):

pS = sum (Y)/size(Y,1);     % all rows with Y = 1 
pE = sum(1 - Y)/size(Y,1);  % all rows with Y = 0

P(x_i|Y):

phiS = X * Y / sum(Y);  % all instances for which attrib phi(i) and Y are both 1
              % meaning all Scotts with attribute phi(i)  = 1 
phiE = X * (1-Y) / sum(1-Y) ;  % all instances for which attrib phi(i) = 1 and Y =0
              % meaning all English with attribute phi(i) = 1 

PhiS and PhiE are vectors that store the probabilities for all attributes. Now that we have the probabilities, we’re ready to make a prediction. Let’s get a test datapoint:

x=[1 0 1 0]';  % test point 

And calculate the probabilities P(X|Y=1) and P(X|Y=0)

pxS = prod(phiS.^x.*(1-phiS).^(1-x));
pxE = prod(phiE.^x.*(1-phiE).^(1-x));

And finally, the probabilities of P(Y=1|X) and P(Y=0|X)

pxSF = (pxS * pS ) / (pxS + pxE)
pxEF = (pxE * pS ) / (pxS + pxE)

They should add upto 1 since there are only two classes. Now you can define a threshold for deciding whether the class should be considered 1 or 0 based on these probabilities. In this case, we can consider this test point to belong to class 1 since the probability pxSF > 0.5.

And there you have it!

Helpful Tips for Newbie System Admins

Before you start reading this tutorial, let me remind you once again who the intended audience of this post is — newbie system administrators. If you’re an experienced admin and are going to laugh at my naivete for writing such basic stuff, please go away — or provide some more ‘advanced’ tips in the comments below so that I know better for the future.

Anyway, let’s begin. If you’re like most newbie system admins, you run a windows system on your home PC / work laptop and connect to the servers you’re managing using putty. Everyone uses putty, you say. Well, yes but there are alternatives. One of the things I hate is to have to copy/paste all the usernames and passwords into that putty session before I can login. So, let’s get rid of that first.

Read more…

Installing Jira with MySQL

Jira is an extremely well-known issue tracking system and is used widely for project management in a wide array of fields. It has quite a detailed documentation but it’s in the form of a wiki and as we all know, wikis are the worst way of creating software documentation. Anyway, in order to install Jira with MySQL, you will have to click and click and click. This tutorial aims to ease this issue by providing step-by-step instructions on how to install jira and enable it to connect with MySQL for storage. So let’s begin.

Note: These instructions are for the standalone version of Jira. You can use them quite easily for the WAR/EAR version but if you’re going on that route, you probably don’t need this article.

Ok, first download and install Java (I usually go with JDK) — version 6 is preferred. You can get the .bin file for Linux from java.sun.com (I refuse to call it Oracle Java). Make it executable and run it. JDK will be extracted in your current directory. Move it to /usr/share. Then set the JAVA_HOME variable.

 
JAVA_HOME=/usr/share/jdk1.6.0_23
export JAVA_HOME

You might want to set this in your ~/.bash_profile file.

Next, create a user with which we will start jira — for security reasons.

 
sudo /usr/sbin/useradd --create-home --home-dir /usr/local/jira --shell /bin/bash jira

Download and extract the jira standalone .tar.gz file to /usr/local/jira and change ownership of all the files to the jira user:

 
chown jira:jira /usr/local/jira -R

Open port 8080 which is used by jira (by default) — edit the file /etc/sysconfig/iptables and enter the following rule:

 
-A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport 80 -j ACCEPT

Set the required jira.home property in the file /usr/local/jira/atlassian-jira/WEB-INF/classes/jira-application.properties

 
mkdir /usr/local/jira/jirahome
vi /usr/local/jira/atlassian-jira/WEB-INF/classes/jira-application.properties
# set the variable jira.home to /usr/local/jira/jirahome

Restart the firewall and start jira using the following command:

 
sudo -u jira nohup /usr/local/jira/bin/catalina.sh run & > jira-nohup.log

The use of sudo will run jira as the user jira and nohup will ensure that jira won’t stop running as soon as you close the shell. The output produced by the command will be saved to jira-nohup.log

Open the browser on location http://your.ip.add.ress:8080/ and ensure that you can see the jira setup page. Don’t bother proceeding with the setup because as soon as we connect jira to MySQL, this information will be lost. Let’s do that now.

Connecting with MySQL

Begin by stopping jira, installing mysql server, setting it to always start on system startup and starting the server.

 
/usr/local/jira/bin/shutdown.sh 
yum install mysql mysql-server 
chkconfig mysql on 
service mysqld start 

Then start mysql, create a database and user for jira and assign all rights to the new user on the new database.

 
mysql 
create database jiradb character set utf8;
grant all privileges on jiradb.* to jira@localhost identified by '[your new password]';
\q 

Now, edit the conf/server.xml in the jira directory and change the data source as follows:

 
<Resource name="jdbc/JiraDS" auth="Container" type="javax.sql.DataSource"
            username="jira"
            password="[jira user password]"
            driverClassName="com.mysql.jdbc.Driver"
            url="jdbc:mysql://localhost/jiradb?useUnicode=true&amp;characterEncoding=UTF8"
            maxActive="20"
            validationQuery="select 1"/>

Note that the ampersand code in the connection string is not a formatting problem. It really has to say amp with an ampersand and a semicolon at either end.

Finally, edit the atlassian-jira/WEB-INF/classes/entityengine.xml file to set the final attribute:

 
<datasource name="defaultDS" field-type-name="mysql"
      helper-class="org.ofbiz.core.entity.GenericHelperDAO"
      check-on-start="true"
      use-foreign-keys="false" ...

Delete the schema-name="PUBLIC" attribute immediately after the changed line to ensure that the database is populated properly.

Start jira once again and now you can enter the setup information. Open MySQL and ensure that jira is, in fact, using this engine.

Creating Multiple Volumes on Amazon EC2

For those of you who have used Amazon AWS (EC2 specifically) for more than just testing would know that the / partition in an AMI does not go beyond 10GB. So, if you need more space you need to mount more volumes. This is a beginner’s guide to do just that.

First off, create an AMI with EBS storage type and take a note of the zone in which that AMI sits. You can find this through the details pane in the ‘instances’ section of your AWS management console. Then, go over to the ‘volumes’ section and create a new volume.

Here, you need to make sure that you create the volume in the same zone as your AMI or you won’t be able to use the volume with it. Specify the capacity of the volume. That’s all that’s required at this stage. Click create and wait until the new volume becomes available. Then right-click on the volume and ‘Attach Volume’. Make sure you select the right instance and then specify the new device name. You can use something like ‘/dev/sdd’. Just use fdisk -l in your running instance to make sure it’s not already in use.

After that, login to your AMI and fdisk -l again to make sure that the new volume is available. Now, you need to create a partition and format it for linux to use. Then, mount it someplace. Here we’re going to move the whole /var to this new partition so that our logs etc can have more free space to grow.

mkfs.ext3 /dev/sdd 
mkdir /mnt/var
mount /dev/sdd /mnt/var 
cp -R /var/* /mnt/var
touch /mnt/var/new-vol

The last line is there to verify after boot that you are indeed using the newly volume. You can also use df -h after reboot to test this but I just feel better this way.

Finally, open up the /etc/fstab file and enter the new device /dev/sdd and the mount point /var.

/dev/sdd    /var    ext3     defaults    0    0

And reboot.

Bash Script to Find and Replace in a Set of Files

Ok, so this might sound childish to the more experienced admins and programmers but I’ve always found the need to search and replace a string in multiple files. For example, if I have to work with an inexperienced programmer’s code, I might have to change the name of the database in a couple of dozen places. If you are in such a place, you might want to try the following script:

for f in submit_*; 
  do sed "s/old_db_name/new_db_name/" < $f > a_$f ; 
  mv a_$f $f ; 
done

The first line finds all files that match the pattern submit_*. The loop first calls sed, makes the replacement and outputs the file to a_$f. Finally, it renames the a_$f file to $f so that we get the original filename. So, there you go. You can make all sorts of complicated finds and replaces through regular expressions and unleash the power of sed on this script. Chao.