Thursday, June 25, 2009

Issue Sensor Redo

Progress of issue sensor is behind expectation. After investigated SensorBase's mechanism, I started to redo the issue sensor. At first, I just want to modify the original code somehow to make it a singleton. However, I kind of mess it up and finally almost redo it from scratch. It is almost finish and just need some unit tests before commit the code.

The sensor is more difficult than I image, because it now actually do some of the analysis job during sensor data collection. In order to preserve as much information as possible, the sensor take not only RSS feed, but also information from issue tracking system directly. The Google Issue Tracking System provides a RESTful link to the issue summary table in CSV format. This gives information include almost as much everything as extracting from individual issue html page. I only keep the field interesting, they are id, type, status, priority, milestone and owner. Here is how the sensor do its job.

Firstly, it retrieve all issue sensor data from sensorbase (It would not be too much because only on piece of data per issue only.), and also get the issue summary table. Then match each issue from the table to a sensordata. If a issue does not have a sensordata for it, it will create a sensordata for it. Secondly, it check the RSS feed, and add update information to the sensor data. The field names, such as id or type, are used as property key, and the property value consists of the field value as well as the timestamp, collected with "--". Finally, new and modified sensor data will be sent to sensorbase.

Monday, June 15, 2009

Cannot put all data search in database

Issue Sensor Data

The issue sensor data now is decided to use single instance per issue design. The owner of the sensordata, which is the major problem of the approach, is set by the issue sensor, before we can come up something better to decide the data owner. We just want to move on to the core part instead of sitting there thinking about the data owner. That is the easiest way, but require most attention of users to ensure things work good, so sufficient documentation may compensate somehow, hopefully.

New API to SensorBase

When coding the issue sensor as single instance per issue, it is important to get the data of the given issue efficiently. While the issue is identified by its id (number id usually), and that id is stored as a property in the sensordata, it will be nice to extend SensorBase's API to return sensordata which contains a given property with given value. So I start to try to do this. Unfortunely, I found it is not easy to accomplish, and should not be include to SensorBase's functionality.

Firstly I study about the mechanism of its API. I thought the API will look like http://hostname:port/sensorbase/sensordata/user/timestamp/?sdt=sensordatatype&propertyname=propertyvalue. It is actually possible to use dynamic property name, by using "{propertyname}={propertyvalue}" in route definition. Then in the resource, just get the property name and property value as two Strings, just the same way as usual property.

Secondly I try to extend the database interface to allow query sensordata with given property entry, when I found the current tables in database does not support this query. In the sensordata table, the values are (Owner, Tstamp, Sdt, Runtime TIMESTAMP, Tool, Resource, XmlSensorData, XmlSensorDataRef, LastMod), and the property list in stored in the XmlSensorData, which is a XML representation of the sensordata. In order to get the data, I need to query the XmlSensorData. However, in order to get a set of XmlSensorData, I need to first get their XmlSensorDataRef, then use that to query the XmlSensorData(It is not the only way to doing this, but it is the convention of SensorBase, probably to make it easier to separate data instances from query return.). When I get the XML data, I parse the XML and extract the property list, then do the comparison. Then this API query will return the XmlSensorDataRef. The user actually need to retrieve the data with the data reference again from SensorBase. As you can see, there are duplicate query from reference to data instance. That's because the SensorBase is trying to do the things that suppose to be done by data user. The most reasonable way is, the user get the data from the data references, then do the comparison to get the data he want by himself.

Therefore, the issue sensor will just get all the issue data(with the sensordatatype = issue), then compare the issue id to get the one it need to add data in.

Monday, June 8, 2009

Stuck in Issue Sensordata Design

Issue sensordata is so different from other sensordata that it does not conceptually belong to a particular user. Instead, it belongs to a software project, but this project is not the same concept of project in Hackystat system. In Hackystat, a project is just a definition to group up users and data to represent a actual project, however, there does not necessary exist an associated actual project. The problem is, all Hackystat sensordata belongs to a user. We have to decide a user to process the data, and that user has to be sure to stay in all the projects which the issue may belong to for the whole life time of the projects. This is like a administrator in the project. But this administrator need to be managed by users, not the system. There is not an administrator defined in Hackystat users, and we dont want to make this exception for just one kind of sensordata(it is not necessary for others).

The first design is to store changes of an issue, starts from the creation of the issue, followed by its updates. Then the issue sensordata is assigned to the owner of the update/creation time. The major resource will be the RSS of the issue updates. The good of this is that it keep all the update in the help of RSS, and the owner of the data is reasonable. But the shortage is that the RSS provides limit information. In the creation thread, it only include the comment. In the update threads, it only include the state/labels that being changed. The current unchanged state is unknown from RSS and has to be found out from the issue tracking system(via http in most case). Another problem is when analyse the data, all data started from the project start time have to be gather together to get the view of the given time. It might be a lot computation if there is lots of updates.

The second design is to make a single sensordata associated to a single issue. The updates will be store in the properties list of that sensordata, from the same data resource: RSS. In the creation of an issue data, the current state/labels will be extract from issue tracking system, then it is easier to keep track of future changes. Also, it is easier to analyze, only need to go through that single data instance to figure out the state of a given time. However, the problem of this design is the owner of the sensordata, because it has to know the owner to get the data, and in project level analysis, that data owner has to be in the project which the issue should belong to. There is no a reasonable way to answer this question without making some hack or modifying/adding current system definition. It is possible to let user define who the data belongs to, but it is unsave because it require not only the user know excatly what he is doing, but also all sensors collecting data for the same project need to be configure excatly the same. Otherwise, there may exist mutilple copies of the data instance, which is a great fault of the single data assumption.