I have this query
SELECT zip,
( 3959 * acos( cos( radians(34.12520) ) * cos( radians( zip_info.latitude ) ) * cos(radians( zip_info.longitude ) - radians(-118.29200) ) + sin( radians(34.12520) ) * sin( radians( zip_info.latitude ) ) ) ) AS distance,
user_info.*, office_locations.*
FROM zip_info
RIGHT JOIN office_locations ON office_locations.zipcode = zip_info.zip
RIGHT JOIN user_info ON office_locations.doctor_id = user_info.id
WHERE user_info.status='yes'
HAVING distance < 50 ORDER BY distance ASC
It outputs
distance | doctor_id | etc.
7 ————— 5 ——- etc
8 ————— 4 ——- etc
34 ————— 4 ——- etc
49 ————— 5 ——- etc
When I select a distance of 30 or less, it shows the top two results as well, which is good.
The Problem : I do not want to show more than one result per doctor_id so I do a GROUP BY user_info.doctor_id, which shows no results when distance is less than 50. For some reason it wants to have all the results to group otherwise it won’t work. Any tips? Anything else you need to help me out?
So What I want is
distance | doctor_id | etc.
7 ————— 5 ——- etc
8 ————— 4 ——- etc
Even though it wants to give me all 4 rows for results, I just want to group them so only the ones with smallest distance per unique user_info.doctor_id show up. Keep in mind distance is a virtual non existent table.
Based on llion’s query here are the results:
(concat(user_info.id)) zip distance id
1 NULL 6.6643992 1
It only gives one result, and in order to get it to work, I had to change the AND to HAVING distance again.
I don’t believe a GROUP BY is going to give you the result you want. And unfortunately, MySQL does not support analytic functions (which is how we would solve this problem in Oracle or SQL Server.)
It’s possible to emulate some rudimentary analytic functions, by making use of user-defined variables.
In this case, we want to emulate:
So, starting with the original query, I changed the ORDER BY so that it sorts on
doctor_idfirst, and then on the calculateddistance. (Until we know those distances, we don’t know which one is “closest”.)With this sorted result, we basically “number” the rows for each doctor_id, the closest one as 1, the second closest as 2, and so on. When we get a new doctor_id, we start again with the closest as 1.
To accomplish this, we make use of user-defined variables. We use one for assigning the row number (the variable name is @i, and returned column has the alias seq). The other variable we use to “remember” the doctor_id from the previous row, so we can detect a “break” in the doctor_id, so we can know when to restart the row numbering at 1 again.
Here’s the query:
I’m making an assumption that the original query is returning the result set you need, it just has too many rows, and you want to eliminate all but the “closest” (the row with the minimum value of distance) for each doctor_id.
I’ve wrapped your original query in another query; the only changes I made to the original query was to order the results by doctor_id and then by distance, and to remove the
HAVING distance < 50clause. (If you only want to return distances less than 50, then go ahead and leave that clause there. It wasn’t clear whether that was your intent, or whether that was specified in an attempt to limit rows to one per doctor_id.)A couple of issues to note:
The replacement query returns two additional columns; these aren’t really needed in the result set, except as means to generate the result set. (It’s possible to wrap this whole SELECT again in another SELECT to omit those columns, but that is really more messy than it’s worth. I would just retrieve the columns, and know that I can ignore them.)
The other issue is that the use of the
.*in the inner query is a bit dangerous, in that we really need to guarantee that the column names returned by that query are unique. (Even if the column names are distinct right now, the addition of a column to one of those tables could introduce an “ambiguous” column exception in the query. It’s best to avoid that, and that’s easily addressed by replacing the.*with the list of columns to be returned, and specifying an alias for any “duplicate” column name. (The use of thez.*in the outer query is not a concern, as long as we are in control of the columns returned byz.)Addendum:
I noted that a GROUP BY wasn’t going to give you the result set you needed. While it would be possible to get the result set with a query using GROUP BY, a statement that returns the CORRECT result set would be tedious. You could specify
MIN(distance) ... GROUP BY doctor_id, and that would get you the smallest distance, BUT there is no guarantee that the other non-aggregate expressions in the SELECT list would be from the row with the minimum distance, and not some other row. (MySQL is dangerously liberal in regards to GROUP BY and aggregates. To get the MySQL engine to be more cautious (and in line with other relational database engines),SET sql_mode = ONLY_FULL_GROUP_BYAddendum 2:
Performance Issues reported by Darious “some queries take 7 seconds.”
To speed things up, you probably want to cache the results of the function. Basically, build a lookup table. e.g.
That’s just an idea. (I expect that you are searching for office_location distance from a particular zipcode, so the index on (zipcode, gc_distance, office_location_id) is the covering index your query would need. (I would avoid storing the calculated distance as a FLOAT, due to poor query performance with FLOAT datatype)
With the function results cached and indexed, your queries should be much faster.
I am hesitant about adding a HAVING predicate on the INSERT/UPDATE to the cache table; (if you had a wrong latitude/longitude, and had calculated an erroneous distance under 100 miles; a subsequent run after the lat/long is fixed and the distance works out to 1000 miles… if the row is excluded from the query, then existing row in the cache table won’t get updated. (You could clear the cache table, but that’s not really necessary, that’s just a lot of extra work for the database and logs. If the result set of the maintenance query is too large, it could be broken down to run iteratively for each zipcode, or each office_location.)
On the other hand, if you aren’t interested in any distances over a certain value, you could add the
HAVING gc_distance <predicate, and cut down the size of the cache table considerably.