I have two vectors, x and y.
x is a vector where each entry represents a month for a period of several years, so I have (let’s say) 10 years of data, then length(x) = 120 and so on.
(I have used the “posix.ct” command so they really are “months” in that sense, but couldn’t I just have x as a numerical vector like c(1:n) or something, since I already know which month and which year a certain element of c(1:n) corresponds to? i.e if x = c(1:n), I know that x[13] is february of the second year and so on..)
y is a vector where each elements is an observation of a particular variable at a certain month.
So the observed data is grouped like this (january,0.123), (february,2.125) and so on.
I have two vectors for the months;
x1 = seq(as.POSIXct("YYYY-MM-DD", tz="GMT"),
as.POSIXct("YYYY-MM-DD", tz="GMT"),
by="month")
x2 = c(1:length(x1))
What I want to do is to run ksmooth:
plot(x1,y)
smooth = ksmooth(x2,y,"normal")
lines(smooth)
The reason that I use x1 in the plot() command is that I don’t know how to otherwise get the x-axis in time.
R should automatically find a decent smoothing parameter when I haven’t specified anything. The result is that ksmooth$y is equal to the input vector y! Also, a vertical bar is produced in the plot. If I replace x2 by x1 in the code above, ksmooth$y is NA for all values except for the first and last, which equal those of the input y.
So i try some bandwidths:
h = 0.1: now smooth$y = y, as before. A vertical bar is produced (it is the same color as I specified in the lines() command, so it must have to do with the ksmooth command.)
h = 10: get some non-strange results for smooth$y, however, a vertical bar is produced as before.
Then, I tried the crazy idea of very large bandwidths;
h = 1e+06: This produced nothing when I used x1 and x2 as in the code above. When I changed x2 to x1 however, I get some good results. For h = 1e+09 (that’s huge!!) I get a very nice result. (I get a curve that fits the data and looks nice)
But h = 1e+09, is that reasonable? in all the examples I have looked h is something betweeen 0.1 and 10, give or take. heard something about a rule of thumb: h should equal n^(-1/5) where n is the number of data points.
I think the one thing that you are missing is that R doesn’t find a decent smoothing parameter when you haven’t specified anything, it just uses a bandwidth of 0.5, which is totally useless in your case.
The other thing you might be missing is that in
ksmooththebandwidthparameter is in terms ofx. Whenksmoothtakes anxvalue ofDate, it converts it to a numeric, which is the number of seconds. Therefore, your bandwidth will be measured in seconds, an undesirable result. Whenksmoothtakes anxvalue of months, it will default to a bandwidth of 0.5 months, also undesirable.What you want to do is specify a reasonable bandwidth for the
xthat you are using. Here is an example: