History | Log In     View a printable version of the current page.  
Issue Details (XML | Word | Printable)

Key: CIB-1406
Type: Bug Bug
Status: Resolved Resolved
Resolution: Fixed
Priority: Major Major
Assignee: jason
Reporter: Tom Jacob
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
Pulse

Pulse build errors out during Junit XML artifacts processing

Created: 17/Mar/08 01:32 PM   Updated: 07/Apr/08 01:46 PM
Component/s: Build core, Agent
Affects Version/s: 1.2.45
Fix Version/s: 1.2.49

File Attachments: 1. Zip Archive TESTS-TestSuites.zip (1.03 Mb)

Environment:
Linux (Mandriva)
2.6.20.4-3mdksmp #1 SMP Tue Apr 3 16:51:23 BST 2007 x86_64
Intel(R) Xeon(TM) CPU 3.00GHz unknown GNU/Linux


 Description  « Hide
We are currently evaluating Pulse to replace our existing build tool. Our set-up includes a Pulse master on a buildbox performing builds using a remote agent. The project type I'm using is the ANT type as we already have stable ANT targets that can handle all of build and test. All's been going pretty well until I got stuck with a problem which prevents Pulse from properly reporting test results and capturing test artifacts.

When this happens the build gets marked to have failed with the status: "Unexpected error: This node has no children"

The ANT build and test targets run and produce the expected artifacts including the TESTS-TestSuites.xml junit xml test report file and corresponding junit html reports, all of which are available on the filesystem in the expected place. I have verified the validity of the TESTS-TestSuites.xml using xmllint --validate and can confirm that there are no validation issues. However, in spite of the above error status, pulse does show the number of tests run as "1364". But this does not match the actual number of testcases in the TESTS-TestSuites.xml file, which is 1550.

In addition, the tool also did not collect the Junit HTML reports from the filesystem, presumably due to some cascading effect of the error in processing the XML report.

I have run the build many times over and the problem presists.

Highly appreciate your response on the matter as we are pushing for the shift to your tool, but this problem is proving to be a stumbling block.

Please find listed below the following:
1. Tail of the command output from the build
2. Pulse file for the project
3. Exception stack-trace from the pulse log (taken from the machine running the agent)

I could provide you the TESTS-TestSuites.xml file if that'd help your investigation of the problem, but it is rather huge (13 MB), so, I'd need an upload location I'm afraid.


tail of the command output log from Pulse shows:
--------------------------
     [exec] report:
     [exec] [echo] reporting: src=/localhome/tjacob/pulse_data/recipes/688144/base/java/ant/artifacts/test_results, dest=/localhome/tjacob/pulse_data/recipes/688144/base/java/ant/junit_html_report
     [exec] [junitreport] Transform time: 13036ms

     [exec] BUILD SUCCESSFUL
     [exec] Total time: 25 seconds
     [exec] Ant called 8 times, succeeded 5 times

 BUILD SUCCESSFUL
Total time: 3,898 minutes 40 seconds
============================[ command output above ]============================
3/17/08 11:47:37 AM GMT: Command 'build' completed with status error
3/17/08 11:47:38 AM GMT: Storing test results...
3/17/08 11:47:45 AM GMT: Test results stored.
3/17/08 11:47:45 AM GMT: Compressing recipe artifacts...
3/17/08 11:50:01 AM GMT: Artifacts compressed.
3/17/08 11:50:01 AM GMT: Recipe '[default]' completed with status error
3/17/08 11:50:01 AM GMT: Collecting recipe artifacts...
3/17/08 11:51:52 AM GMT: Collection complete
-------------------------------------------------------------------------

Pulse file for the Project:

========================
<?xml version="1.0"?>
<project defaultRecipe="ant build">

        <junit.pp name="junit"/>

    <recipe name="ant build">
        <command name="build">
            <!-- pull in the ant resource if configured. -->
            <resource name="ant" required="false"/>
            <ant
                    build-file="scheduled.build.xml"
                    targets="ci_build_and_version run_release_tests"
                    args="-DartifactsDir=artifacts -DjunitHtmlReportDir=junit_html_report -Dbranch=trunk"
                    working-dir="${base.dir}/java/ant"
                    >
                            </ant>

                <dir-artifact name="deliverables" fail-if-not-present="false"
                base="java/ant/artifacts/deliverables"
                                >
                        </dir-artifact>
        <artifact name="junit xml" file="java/ant/artifacts/test_results/TESTS-TestSuites.xml" fail-if-not-present="false"
                >
                <process processor="${junit}"/>
    </artifact>
        <dir-artifact name="junit html" fail-if-not-present="false"
                base="java/ant/junit_html_report"
                                >
                        </dir-artifact>
    
        </command>
    </recipe>
</project>
=====================================

The exception stack-trace from the pulse log file from the agent installation:

SEVERE: This node has no children
java.lang.IndexOutOfBoundsException: This node has no children
        at nu.xom.ParentNode.getChild(Unknown Source)
        at com.zutubi.pulse.core.JUnitReportPostProcessor.getMessage(JUnitReportPostProcessor.java:130)
        at com.zutubi.pulse.core.JUnitReportPostProcessor.processCase(JUnitReportPostProcessor.java:124)
        at com.zutubi.pulse.core.JUnitReportPostProcessor.processSuite(JUnitReportPostProcessor.java:83)
        at com.zutubi.pulse.core.JUnitReportPostProcessor.processDocument(JUnitReportPostProcessor.java:50)
        at com.zutubi.pulse.core.XMLReportPostProcessor.internalProcess(XMLReportPostProcessor.java:33)
        at com.zutubi.pulse.core.TestReportPostProcessor.process(TestReportPostProcessor.java:59)
        at com.zutubi.pulse.core.LocalArtifact.captureFile(LocalArtifact.java:151)
        at com.zutubi.pulse.core.FileArtifact.scanAndCaptureFiles(FileArtifact.java:90)
        at com.zutubi.pulse.core.FileArtifact.capture(FileArtifact.java:75)
        at com.zutubi.pulse.core.CommandGroup.execute(CommandGroup.java:69)
        at com.zutubi.pulse.core.RecipeProcessor.executeCommand(RecipeProcessor.java:268)
        at com.zutubi.pulse.core.RecipeProcessor.build(RecipeProcessor.java:204)
        at com.zutubi.pulse.core.RecipeProcessor.build(RecipeProcessor.java:100)
        at com.zutubi.pulse.slave.SlaveRecipeProcessor.processRecipe(SlaveRecipeProcessor.java:73)
        at com.zutubi.pulse.slave.command.RecipeCommand.run(RecipeCommand.java:30)
        at com.zutubi.pulse.slave.ErrorHandlingRunnable.run(ErrorHandlingRunnable.java:35)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:650)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:675)
        at java.lang.Thread.run(Thread.java:595)

==================================


 All   Comments   Change History      Sort Order:
Tom Jacob - 17/Mar/08 02:13 PM
Attaching the junit xml report file that pulse failed to process

Tom Jacob - 17/Mar/08 02:15 PM
--
Attached the Junit XML report file that pulse failed to process

jason - 17/Mar/08 02:15 PM
Hi Tom,

Thanks for the very detailed report; this made the problem very easy to track down. Essentially, the JUnit report post-processor is assuming the existence of a detail message for failed test cases. It actually tries to check for the existence of the message first, but in an incorrect way. Thus if your report has a test failure or error XML element that does not have text content, the post-processor fails. I have reproduced this with a test case and can fix it quite easily. We can make a new build available for you this week so you are not held up any longer. Sorry for the trouble!

jason - 17/Mar/08 02:21 PM
Hi Tom,

I forgot to confirm that the HTML report not being captured is almost certainly due to the earlier problem. I also took a look at the XML report you just attached (thanks) and found there were indeed some <failure>s with no nested text, which would trigger this problem. It seems that JUnit itself always produces detail messages, which is why we have not uncovered this bug until now.

jason - 17/Mar/08 05:59 PM
Fixed in change 4171.

Tom Jacob - 18/Mar/08 09:54 AM
Hi Jason,

Thank you very much for looking at this issue on priority. Look forward to the next release build with fix. Am I right in assuming that I'll get the version 1.2.49 for download from http://zutubi.com/products/pulse/downloads/ sometime this week or is there a separate location for downloading nightly builds?

Oh another question I have is -
Is there any way I can just use the new core-1.2.x jar file and not re-install/upgarde the whole product?

Thanks again for the great support.

jason - 18/Mar/08 10:15 AM
Hi Tom,

No problem! The 1.2.49 release will indeed be available from our website downloads page. We like to do frequent builds (and since it is all automated with our own Pulse install it is easy enough :) ), every one of which is a full release. We do not currently do nightlies - although it is something we will probably add one day as it is again simple to schedule in our own Pulse install.

You might be able to drop in the new core jar, but there will be a few other changes some of which span other jars so it is not guaranteed to work. Actually, upgrading the whole install is extremely simple - you just need to backup, then unpack and start the new version and point it at your existing data. If there are any schema upgrades (in this case I don't believe there will be), Pulse will show you an automatic schema update page and after running through that things will be going again.

jason - 20/Mar/08 06:08 PM
Hi Tom,

Just FYI: Pulse 1.2.49 is now available from http://www.zutubi.com/products/pulse/downloads/. It includes a fix for this issue.

Tom Jacob - 03/Apr/08 12:42 PM
hi Jason,

Thanks. I did get the update installed and the system is now able to parse the JUnit Xml test results properly. But, the updated software when pointed to the old data directory did not show the existsing projects as I expected. So, after a few attempts I had to manually redo the whole lot.

Aside from that, we have had a new issue, namely, getting build failures due to "connection to agent lost" error. I reckon this happened about 3 to 4 times since I've installed. I have looked into the bug list and find that this has been raised before and is explained as an artifact of faulty/unavailable network. All of our other network apps are working Ok and we coldn't find anything of interest happening on the network during these failures. So, this is a cause for concern for us, and we've decided to trial run pulse in parallel with out existing tool for a month and collect statistics to properly analyze the risk due to this vulnerability (which may be due to the network flakiness) before we make a decision on replacing our existing tool with pulse.
On account of this exercise I'd like to request you for an extension of our trial period (which ends on the 6th) by a month if that is possible. I'm not sure if you are right person to contact in this regard. Apologies if you are not and could you in that case please direct me to someone who can speak on commercial aspects.

jason - 03/Apr/08 01:36 PM
Hi Tom,

Good to hear that this fix works for you. However, it is not so great to hear of your upgrade troubles (particularly after I said it was so simple)! It seems something unusual happened in your environment, as this is certainly not expected. May I ask some questions to try and track this down? Specifically:

1) Did you do the upgrade as the same user that was running Pulse?
2) When you installed the new version, what setup steps did you see? Steps include:
  - prompt for data directory
  - license page
  - admin user page
  - server settings page
3) What does the original data directory look like? Has it been replaced with current data?

Re: connection lost issues, this has been caused in the past by network reliability issues and also very high load on agents during builds. An agent is determined to drop offline if it fails to respond to a master ping within a certain timeout. There are parameters you can tune: the ping frequency and timeout. See:

http://confluence.zutubi.com/display/pulse0102/System+Properties

for details. If problems persist then we can help you with extra debugging.

Lastly, I will email you directly regarding the license.

Tom Jacob - 07/Apr/08 10:54 AM
Hi Jason,

Thanks again for the extended license.

On the issue of lost data, answers to your queries are:

1. yes, i did do the install as the same OS user as the one running pulse

2. what I did to upgrade was:
     backedup the old pulse data directory and mv-ed the pulse-agent* directory
     extracted the new pulse agent tar
     started pulse; and it started up with ~<userhome>/.pulse as the data directory. So, I stopped the server and restarted with -d pointing to the original pulse_data directory. This time it did refer to the new data directory but saidy "empty database, initializing.." or something similar. I logged in and found no projects. btw, it did not prompt for data directory (presumably as I had set it from command line?) and went straight to the license prompt and onwards as you have described.

3. Sorry, but I have since deleted the copy of the old pulse_data folder. As to the new one, yes it has only new config in it.

Re: the connection lost issue, which is more important from our perspective -

This issue did occur over the weekend again. This time connection to the agent seems to have been lost during the checkout from subversion.
The server log does show a ping timeout at the same time as the connection loss is reported on the build log, so I'm refraining from opening up another issue as you seem to think this is not a pulse issue.

So, I've decided to modify the ping.timeout and interval values as you have suggested. But, was wondering if you had any recommendations on how the interval and timeout params should be used in this case. I'm contemplating increasing the timeout and decreasing the interval so that there are multiple pings attempted within the timeout period; which I hope will give it a better chance at getting a reply back. I'm assuming here that the ping and it's reply is a custom hearbeat check that you have built into the masters/agents and does not have a big network footprint.

Your comments/recommendations on this would be much appreciated.

Tom Jacob - 07/Apr/08 10:58 AM
Hi Jason,

In my above comment; please ignore references I've made to pulse-agent in the section where I've tried to detail the upgrade process I followed. I meant "pulse server" in both those cases.

jason - 07/Apr/08 01:09 PM
Hi Tom,

OK, using the same user simplifies things. From your description, I think the problem may have been an incorrectly-specified data directory. The default data directory suggested by Pulse is $HOME/.pulse/data, not $HOME/.pulse. Pulse does put a config file in $HOME/.pulse to hold configuration values that are required before starting the web UI, but this file is not part of the $PULSE_DATA. So your steps all sound right except that the data/ part of the directory was omitted when starting the new version with -d, leading Pulse to think that there was no existing data.

Re: agent pings, any failed ping will cause the agent to be marked offline, so making them more frequent will not help. In fact, if there is a long delay waiting for a response the agent should not be pinged again until the response arrives or a timeout occurs. So I would recommend a setting where your ping timeout is longer than the average. The ping interval may also be set longer to reduce the chance of a bad ping, although this will not in itself solve any underlying problem.

The fact that the timeout occured during the checkout is interesting. I doubt there was much load on the agent due to checkout which suggests that CPU overload is not the issue. Is the Subversion server across the same network? Perhaps the collision in traffic (check + ping simultaneously) makes a timeout more likely. The pings are lightweight so should not create a lot of network load, but it is still something to keep an eye on (especially if this happens during checkouts at other times).

Tom Jacob - 07/Apr/08 01:46 PM
Hi Jason,

Thanks for the explanation on the properties. I'll be changing the timeout setting as suggested. Hope that atleast reduces the frequency of this failure to an acceptable level. As to the subversion server, yes, it is on the same subnet.