Ublox M8N half cycle valid bit

In my last post I introduced a new more challenging data set with a higher number of cycle slips than in previous sets. Even after improving the RTKLIB code to better handle missing data samples, there are still a large number of points in the solution with only a float status (yellow in the plot). In addition, the last 15 minutes appears to have at least a meter of error in the vertical axis since the data was taken in a series of loops and the vertical measurements should be the same as earlier loops.

 

half_cycle1

By plotting the observations of the rover, we can see that there are many cycle-slips on a large number of the satellites. This is the primary cause of the poor solution. Here’s the rover observations with a 15 degree elevation mask.

half_cycle2

In order to improve this solution, we are going to have to take a closer look at the cycle slips. The first thing I noticed when examining a few of these cycle slips is that there are a lot of half cycle errors in the carrier-phase measurements following a reported cycle slip. This is true whether or not an actual cycle slip occurred. I had seen this earlier as well when doing an exercise of looking at the double differences. Here’s a plot from that previous post showing what I am talking about. The second red circle contains a number of samples all in error by almost exactly one half cycle.

half_cycle3

Let’s go back to the raw ublox data before it is converted into RINEX format to see if that can shed any light on what is going on. To see the raw unconverted ublox data I enabled the trace option when running convbin by adding “-trace” to the command line. I also had to change an “if #0” in the code in the decode_trkmeas() function in ublox.c to an “if #1” to output the full debug information. Below on the top is the rover observations zoomed into a time period around 6:45:27.0 and below is the raw ublox data for that sample.

half_cycle4a

half_cycle4b

The flag byte is what we are interested in. Since the M8N does not officially support raw output, there is no documentation for this data byte. From the existing ublox.c code though we can see how RTKLIB is using it. It is interpreting bit 5 (0x20) as phase lock and bit 6 (0x40) as the half cycle bit. When the half cycle bit is set, one half cycle is added to the carrier-phase measurement. The other 6 bits are being ignored.

Looking at the documentation for the M8T which does officially support raw data provides a hint as to what might be going on. In the description of the RAWX command one of the bits in the trkStat byte is used to indicate whether the half cycle bit is valid or not. Since the M8N and the M8T share the same core, it would be reasonable to assume there was an equivalent bit in the M8N.

We know from the data that the half cycle errors occur shortly after a reported cycle slip so let’s look at those satellites. From the observation plot we can see that in this case satellites G30, R03, R22, and R23 all have reported cycle slips in the last few seconds before the sample we have raw data for (6:45:27.0). If we look at the flag byte for those satellites, we can see that bit 7 (0x80) is clear for all four satellites. It is set for all the other satellites except G21 which has no valid data and I120 which is the SBAS satellite. Each line in the raw data is one satellite, “sys” is the system (0=GPS,1=SBAS,6=GLO) and “prn” is the satellite number. We can also see that when bit 7 is clear, the half cycle bit (0x40) is never set which would also be consistent with bit 7 being the “half cycle valid” flag.

I added the following line of code to the decode_trkmeas() function in ublox.c to update the observation with the half cycle valid status in the same way it is done for the M8T.  RTKLIB refers to this bit as the “Parity Unknown” bit but in a comment in the RTKPLOT part of the manual explains that this mean the half‐cycle ambiguities in carrier‐phase are not resolved.

     raw->obs.data[n].LLI[0]|=flag&0x80?0:2; /* half cycle valid */

I then reran the conversion using the convbin raw data conversion CUI. Plotting the updated observation data using RTKPLOT with the “Parity Unknown” option set to “ON” gave the following result:

half_cycle5

The gray ticks indicate unknown parity (i.e. half cycle invalid). Unfortunately the “unknown parity” ticks seem to take precedence over the “cycle slip” ticks, so any sample that has both shows up as “unknown parity”. This is why there seems to be less cycle slips in this plot than the original plot above.

So, let’s rerun the solution on the improved observation data using the same configuration settings as previously. The result is plotted below.

half_cycle6

This looks a lot better. Not only do we see a lot more fixes, but the obvious vertical errors between 7:15 and 7:30 have gone away. The two yellow sections in the ground track plot on the left align with the two locations where the car went directly underneath overhanging branches so it is not surprising that those spots are going to be more difficult to maintain fixes for.

There may be more opportunity for improvement here if we can figure out how to take better advantage of the half cycle status. The current RTKLIB code simply resets the phase bias estimate for that satellite every time this bit changes state.

I have uploaded this change to the demo4_b12 branch in my GitHub repository. I have also posted the modified rtkconv executable here.

The reason that I posted rtkconv (the GUI version) instead of convbin (the CUI version) is that I am now able to build the GUIs with my new Embarcadero compiler and because I have found that convbin does not seem to work for the newer (3.xx) formats of RINEX.  I prefer the newer formats because they are easier to parse.

RTKLIB GUI Builder: Special Deal $49

For anyone who is interested in building the GUI versions of RTKLIB,  Embarcadero has a special deal going on right now for the new starter edition of their C++ Builder for just $49. It’s available at least for the next seven days.  Apparently it was available for one day for free but I unfortunately missed that opportunity.  The regular price is $217 so this is a significant discount.

I have bought a copy and tried it out on a few of the 2.4.3 b8  modules.  All built either error-free or just required a couple of minor fixes.  Apparently it does not work yet on the very latest version of RTKLIB but I’m sure somebody will fix it soon.

For those of you who don’t know, the  CUI versions of RTKLIB can be built with either the Visual Studio C++ or GCC, but to build the GUI apps, the Embaracero compiler is required.

 

 

RTKLIB: Optimized Receiver Dynamics

 

Last post I mentioned adding a “pseudo” receiver dynamics mode to RTKLIB that gives most of the benefit of the receiver dynamics option but with a much smaller number of computations. Since then I have realized that a better alternative is to optimize the existing dynamics mode to use less computations rather than replace it with something new. That is what I will discuss this post.

First let’s take a quick look at how the receiver dynamics feature works.

There are normally two steps for each time sample in the kalman filter calculations. In the first step, the kalman filter predicts the receiver position for the current time sample based on the receiver position of the previous sample, the time elapsed since the last sample, and the known dynamics of the receiver. It also estimates the confidence of that estimate in terms of variances. In the second step, these predictions are updated based on the input measurements from the receiver. Confidence levels (variances) are also estimated for the measurements and the relative weighting between prediction and measurement used in this update are determined by the variances.

It is the first step that we are interested in here. It is referred to as the time update of the kalman filter in the documentation. It is also sometimes referred to as the state prediction.

If receiver dynamics is set to off, then this first step is skipped. Instead of predicting the position, RTKLIB simply uses the coarse measurement derived from the pseudorange measurements. This measurement will generally be much less accurate than the predicted position, which is why enabling receiver dynamics is desirable.

If receiver dynamics is set to on, then RTKLIB will add six states to the kalman filter, three for receiver velocity (x, y, z) and three for acceleration. This is in addition to the existing three states for receiver position (x, y, z). The kalman filter will then use these estimates of the receiver’s velocity and acceleration when predicting what the receiver position will be one sample later. These states are updated in a similar way to the other states, but note that unlike most of the states, these do not have direct measurements associated with them.  Since, in typical RTKLIB applications, the time between samples is usually quite short relative to the receiver’s maximum acceleration, these predictions should be quite accurate. The confidence of these position estimates will also be updated based on the length of time to the next sample and the acceleration characteristics of the receiver defined by the configuration file input parameters “stats-prnaccelh” and “stats-prnaccelv”. The longer the time between samples and the larger the possible accelerations of the receiver are, the less confident the kalman filter will be in its estimates. Less confidence means higher variances.

For more details on exactly how this is done let’s look at Appendix E in the user manual. In section E.7 we find the matrix used for the time update of the kalman filter.

predict

The updated state vector x(k+1) , one sample time later than x(k), is calculated by multiplying this matrix by x(k). In other words:

     x(k+1) = F * x(k)

In this case F is a square matrix, nxn, where n is the number of states. The exact number of states will depend on input parameters but for my example, with GPS, GLONASS, SBAS, and dynamics enabled, n is equal to 98. This is one state for each satellite plus the nine for the receiver mentioned above. Multiplying an nxm matrix with an mxp matrix, assuming no shortcuts are taken, requires nxmxp multiplications, in this case, 98x98x1=9604.

Let’s take a closer look at F, the state-transition matrix. First, notice that it is almost equal to the identity matrix, which is a matrix with all 1’s on the diagonal (I is a notation for identity matrix here). Only the one 3×3 term multiplied by tau is off the diagonal. If it wasn’t for this term, x(k+1) would simply equal x(k) since multiplying a matrix by I is equivalent to multiplying a scalar by one. This makes sense since we are only updating the position states … all the satellite bias states should remain constant with time.

So what is the tau term doing here? Let’s look at the first row in the F matrix. By replacing the I notation with actual numbers we can see that it is equal to {1 0 0 τ 0 0 …} where the remaining 92 elements are all zero. In this case τ is the time difference between samples. Since the first three states are position in the x,y,z axes and the next three states are velocity in the x,y,z axes, the first three rows of F corresponds to the equations

posx(k+1) = posx(k) + velx(k) * τ
posy(k+1) = posy(k) + vely(k) * τ
posz(k+1) = posz(k) + velz(k) * τ

So what could appear to be a somewhat intimidating equation in matrix form turns out to be quite simple and familiar when reduced to its parts.

Since the rest of F is the identity matrix, the equations for all the other states will be even simpler and will just be equal to their previous value.

In actuality, the version of the F matrix I’ve quoted here from the documentation is not quite right and probably reflects an earlier version since it only adds velocity states and not acceleration states. The correct F matrix to match the code is:

predict6

The only difference is we’ve added the acceleration states in addition to the velocity states and update the velocities with the accelerations just as we updated the positions with the velocities.

Since only the first nine states are used in the off-diagonal terms of the equation, there really is no need to multiply the full matrices to update the state vector. We can calculate the first nine states which requires 9x9x1 multiplications, and replace the rest of the operation with a matrix copy, which is equivalent to multiplying by the identity matrix but requires only one step per element instead of 98 floating point multiplies.  This is what I did in the code.

Percentage-wise, the computational savings are significant since we have reduced the floating point multiplies from 98×98 to 9×9. However, the number of multiplies in this step is not large to start with so the savings are not significant.

The real opportunity is in the calculation of the covariance matrix P which is a similar operation but much more computationally intensive because the covariance matrix is 98×98 elements, unlike the state vector which is only 98×1 elements. The larger size is because it contains the covariances (or correlations) between each state and every other state. We use a similar operation and the same state-transition matrix F, but the equation becomes

     P(k+1) = F * P(k) * F’

where P is the covariance matrix, and F’ is the transpose of the F matrix. Again, because F is very close to the identity matrix, we can replace most of the multiply by a copy, then calculate directly the few elements that change with the state update. It is a little more complicated because to realize a significant savings we can’t get away with just multiplying a sub-matrix like we did with the state vector. Instead we directly compute the elements that need to be modified. To calculate F*P, the original code looked like this:

    matmul(“NN”,rtk->nx,rtk->nx,rtk->nx,1.0,F,rtk->P,0.0,FP);

Since F and P are both 98×98 element matrices, this call required 98x98x98 or 941192 floating point multiplies per sample time. I replaced the above matrix multiplication with the following lines of code

     /* P=F*P, only calc non-zero off diaganol terms to save time */
     matcpy(FP,rtk->P,rtk->nx,rtk->nx);
    for (j=0;j<rtk->nx;j++)
          for (i=0;i<6;i++)
                FP[i+j*rtk->nx]+=rtk->P[i+3+j*rtk->nx]*tt;

This gives the identical answer as the full multiply but only requires 98×6 or 588 floating point multiplies. A similar substitution for the FP * F’ operation saved nearly another million multiplies every sample.

While it will vary from system to system, I was able to roughly estimate how much time is saved in my particular case by adding a call to the Windows timeGetTime() function before and after the call to the state update function, Before the modification, the state update was taking roughly 15 msec which was about 50% of the total computation time for each epoch. With the change, the time was too small to measure. I also verified that the solution results after this change are identical to the previous results which would be expected since we haven’t changed anything except the way we do the calculation.

I had previously uploaded my “pseudo” dynamics feature to the demo3 code in my GitHub repository, but with this change I felt there was no value in that feature, so have removed it and replaced it with this one.

RTKLIB: Receiver Dynamics and Outlier Rejection

Several readers now have mentioned that they have had to set the receiver dynamics option in the input configuration file to “off” when running solutions in real-time because of limited CPU bandwidth and that this leads to poorer results. I don’t have this problem in my experiments because I am post-processing the data and so the CPU does not need to keep up with the input data.  Hence I have always had this option set to “on”. But I hope to switch to real-time processing with an SBC at some point and decided to take a look at this issue.

First of all I tried disabling receiver dynamics and re-running the solution for the data set I introduced in the previous post, using my demo3 version of RTKLIB again. The plot on the left is position with receiver dynamics enabled, on the right is with dynamics disabled, otherwise the input options are identical.

dynamics1

Clearly there is some serious degradation with dynamics disabled! The difference is not a complete surprise because when we disable dynamics, we are throwing away some valuable information. The amount of degradation maybe should have been a clue that something else was wrong but at the time I didn’t investigate closely enough why things got worse. Instead I went ahead and implemented a “pseudo-dynamics” mode that uses a small fraction of the calculations of the full dynamics mode, but gives most of the benefit. I think this is a useful improvement and in fact it did make the problem go away and I will discuss that solution in the next post … but it turns out that even though it made the problem go away, it did not address the root cause, it just covered it back up again.

It wasn’t until I was testing this new feature that I started to see some strange things and  realized that the lack of dynamics was not enough to explain what was going on.

So let’s take a closer look at the results with dynamics disabled. Unfortunately there are no outputs visible to RTKPLOT or in the output files where the problem can be seen, so it requires digging into the trace files. Below are some snippets from the trace files showing the residuals of the initial double differences from a sample just as the solution first started to degrade. The residuals available in RTKPLOT and the output files are the residuals after the last iteration of the kalman filter and not the initial residuals.  These will be significantly smaller and so do not show the problem.

The trace on the left is with dynamics enabled, and the right is with them disabled. I will discuss more about how dynamics works in the next post, for now, if you are not familiar with the feature, just be aware that it improves RTKLIB’s initial guess of the receiver’s position each sample by using information from the previous positions.

dynamics2

The double difference residual for each satellite pair is listed after the “v=”. The L1 and P1 rows are for the phase measurements and the pseudorange measurements respectively. Because the initial position estimates are more accurate with dynamics on, you can see that the residuals on the left are significantly smaller than the ones on the right. Also, in this particular sample the receiver reported a cycle slip on satellite 33 and you can see the residuals are largest in both cases for this satellite. The most important difference between the two is that the larger residual with dynamics off was large enough to trigger the outlier rejection threshold, resulting in that residual to not be used as an input to the kalman filter. Introducing a non-linearity like this into a feedback loop always risks affecting its stability which looks like what happened here. Without any feedback, the errors continued to grow and to be rejected, eventually causing other residuals to be rejected, until the whole solution fell apart.

The threshold used by RTKLIB to reject outliers is adjustable and is set by the input parameter “pos2-rejionno” in the input configuration file. The name is an abbreviation for “reject innovations” although there seems to be an extra “o” . Innovations is a term for the error inputs to the kalman filter. The default value and the one I have been using in my experiments for this threshold is 30 meters. This is consistent with the two residuals rejected in the above example, both greater than 30.

There doesn’t seem to be anything magic about 30 meters, especially when we are striving for centimeter accuracy so I went ahead and increased it all the way up to 1000 meters to be sure I didn’t trip over it again, then re-ran the solution.  Here is the result. Position is on the left and the difference in position with dynamics on and off is on the right.

dynamics3

Increasing the outlier threshold completely eliminated the problem. What is more surprising is that there is very little difference in the position solution with dynamics on or off. The larger errors in the initial position estimate are still there as are the larger initial residuals but the additional iteration of the kalman filter is apparently able to remove nearly all of the initial position error as can be seen in the right plot above.  

So bottom line is, I don’t think outlier rejection is working properly in RTKLIB and I plan to leave this threshold at 1000 to effectively disable this feature until I see a need to re-enable it.

This problem is not limited to when receiver dynamics are turned off and can happen anytime large residuals occur.  For example, once I knew what to look for, I was able to find the same problem occurring in the initial transient at the beginning of the solution.

To demonstrate this I did another experiment.  In a previous post I described adjusting the solution start point around in the part of the data in which the rover was moving until I was able to get a bad fix. This time I did the same thing but in the part of the data in which the rover was stationary. I did this with the outlier threshold set back to 30.  Again I was able to find a start time that caused an initial bad fix. I checked the trace file for rejected outliers during the initial transient and sure enough they were there. So once again, I increased “pos2-rejionno” from 30 to 1000 and re-ran. The transient was almost entirely eliminated, and I got a good first fix. Here’s the position plots for the two cases, threshold=30 on the left, threshold=1000 on the right.

dynamics4

Notice the difference in y-axis scales and the size of the initial transient. With the threshold set to 1000, as would be expected, there were no outliers rejected in the trace file.

I suspect another thing that aggravates this problem in my case is when I adjusted the input parameter eratio1 (ratio of pseudorange measurement errors to carrier phase measurement errors) from 100 to 300.  This reduced the time to first fix but also increased the overshoot of the initial transient and hence would be more likely to trip the outlier threshold. 

So is there a risk that opening up this limit will cause other problems where data that should have been rejected is not? Possibly, but I suspect the benefits of opening up this limit will outweigh any downside. I plan to keep an eye out for true bad data points and deal with them once I have some real examples, but won’t worry about hypothetical cases for now.

So to sum up, I would suggest increasing this limit even if you are not seeing problems at the moment, and be on the lookout for “outlier rejected” messages in your trace files if you are having problems.

More Moving Rover / Fixed Base Data

I decided to try something a little more challenging for the next data set. Here’s a section of a dirt road with many trees on one side of the road and just a few on the other side. It could be representative of a farm field bordered by woods. The trees should cause a fairly large number of cycle slips on some of the satellites and just a few on others. The road is also rough enough to bounce the receiver around a bit. This should add some vertical and lateral accelerations.

road (1024x768)

I used the same two Ublox M8N receivers and antennas as last time, only this time I remembered to bring the pie pan so didn’t have to use the old beer can again for a ground plane. Here’s a photo of the improved base station.

piepan (1024x768)

I again set the receivers up to collect 5 Hz GPS and GLONASS data. Last time this worked with no issue, but this time I saw what other people have reported … I got 5 Hz data but all the GLONASS satellites were missing. After a bit of fiddling with the receiver setup what I found was that I had to first collect a brief bit of data at 1 Hz, then switch the receivers to 5 Hz and everything worked fine. To make this easier, I added an extra line to the beginning of the list of startup commands in STRSVR to set the sample rate.

I now have two startup files, one for 1Hz and the other for 5 Hz.

The 1 Hz startup file now looks like this:

!UBX CFG-RATE 1000 1 1
!UBX CFG-GNSS 0 32 32 1 0 10 32 0 1
!UBX CFG-GNSS 0 32 32 1 6 8 16 0 1
!UBX CFG-MSG 3 15 0 1 0 1 0 0
!UBX CFG-MSG 3 16 0 1 0 1 0 0
!UBX CFG-MSG 1 32 0 1 0 1 0 0

and the 5 Hz startup file looks like this:

!UBX CFG-RATE 200 1 1
!UBX CFG-GNSS 0 32 32 1 0 10 32 0 1
!UBX CFG-GNSS 0 32 32 1 6 8 16 0 1
!UBX CFG-MSG 3 15 0 1 0 1 0 0
!UBX CFG-MSG 3 16 0 1 0 1 0 0
!UBX CFG-MSG 1 32 0 1 0 1 0 0

Once I got past this hurdle, I was able to collect some data. I set up the base station in the middle of the field with unobstructed skies and attached the rover receiver and antenna to the top of the car. For the first fifteen minutes the car was stationary on a part of the road with relatively unobstructed skies, and for the remaining time I drove back and forth along the edge of the trees. Here’s a couple plots of the observation data files, base on the left, and rover on the right. The red ticks indicate cycle slips.

argeles0

As expected, the base data is free of cycle slips, but the rover starts to see a fairly large number of them as soon as the trees start to obstruct the rover antenna. For comparison, in the previous data set I collected in open skies, the rover data was nearly completely free of cycle slips.

Processing the data with RTKCONV, then running the solution with my demo3 version of RTKLIB, gave the following solution, ground track on the left, and position on the right.

argeles1

So the results looks very good, 100% fixes after the initial acquire and through all the cycle slips. But how do we know the positions are all correct? In the earlier data sets, I mounted both receivers on a single rover which forced the solution to be a circle of fixed radius. In that case any point that fell off the circle must be an error and was quite easily spotted. In this case, verifying the data is more difficult because we don’t know what the correct solution is. I don’t think there is any one easy answer to this question, but there are some things we can do to gain confidence in the results. If we run the solution with different input configurations and different sets of satellites and get the same answer, that helps. Sometimes errors are more obvious in the vertical component since it tends to be more constrained. For example, if you drive in circles and the starting point varies by 10 cm in the x or y direction that could be normal variation, but if you end up 10 cm deeper every circle something is probably wrong.

The best single test I was able to come up with is to compare two solutions, one with ambiguity resolution mode set to fix-and-hold and the other with it set to continuous. In continuous AR mode, fixes are much more independent of each other since there is no direct input to the kalman filter from the result of the fixes. That is not true of fix-and-hold, which feeds information from the previous fix into the kalman filter to help find subsequent fixes. In general, most problems with erroneous fixes occur when fix-and-hold is enabled. So the goal won’t be to prove the position is correct, only that fixes are validated as reliably with fix-and-hold enabled as they are with continuous AR mode.

RTKPLOT has a nice feature that makes this comparison quite easy. After running both solutions and then opening them as solution 1 and solution 2 in RTKPLOT, you can then plot the difference using the ‘1-2’ button in the top left of the GUI.

For this particular instance, to make the two runs even more independent and to validate my code changes, I ran the continuous AR case using the original RTKLIB code and configuration without any of my modifications (except enabling dynamics), and ran the fix-and-hold case with my demo3 code and configuration file with the extended fix-and-hold enabled for both GPS and GLONASS satellites.

Below on the left is the continuous AR solution. It finds fixes quite easily while the rover is stationary but after the rover starts moving and cycle slips start occurring it has more trouble and gradually drifts off, getting fewer and fewer fixes. The plot on the right below shows the difference between the two solutions. The green indicates where both solutions had fixes, and the yellow where one or both were not fixed.

argeles2

As you can see, every time both solutions get a fix, they differ by less than a couple centimeters in the x and y directions and a little more in the z direction. These are usually quite acceptable and too small to be caused by invalid fixes since one cycle is about 20 cm. We can not directly confirm the positions for which the continuous AR mode did not get fixes but based on the continuity of the points in between they are most likely correct as well.  So, at least for this data set I believe the fix-and-hold has reliably given us the correct position.

What happens with this test if there are erroneous positions in the fix-and-hold data? On the left below is an example of another fix-and-hold solution of the same data set. Although this one also has 100% fix after the initial acquire, it is definitely incorrect. You can see after six passes on the road, the car is now half a meter higher than it started! Comparing it with the same continuous AR case as above gives the plot on the left, where it is even more obvious that it is not correct since the green dots are all significantly distant from zero.

argeles3

I was able to create the incorrect solution above simply by delaying the starting point of the solution from the beginning of the data where the car was stationary and there were no cycle slips, to a point later in time where the car was moving and there were numerous cycle slips. You can see in the above plots that the solution starts at 7:16, where previously it started at 7:00. There is nothing magic about 7:16, I just moved the starting point around until I found a point where it broke the solution.

This brings up a very important point. Fix-and-hold is always going to be most vulnerable to error before it has acquired its first fix. There are things we can do to reduce this vulnerability but that will always be its weakest point. For that reason, I believe it is essential when using fix-and-hold, at least in it’s current state, to start it only on clean data where the rover is stationary and the skies are unobstructed. This would normally be done by turning the receivers on and leaving them in a fixed, open-sky position for some amount of time before letting the rover move. How long this is will depend on the quality of the antennas, whether GLONASS satellites are enabled, etc. If the data was being processed real-time, it would be easy to use an LED to indicate when fix-and-hold has locked to the data. I imagine some people are already doing this.

I’ve added this data and the configuration file I used with it to my library of data sets available here in the argeles_car folder in case anyone else wants to experiment with it.