The Simple Act of Integrating a Weather API - Part 2 September 18, 2024
Sometimes I rush a little bit through it, and I definitely didn't look close enough at the actual data from geonames.org
I had to revisit my solution for finding the coordinates of a city. It also took a lot of manual work.
So, what I did is I wanted to see ahead of time how successful my lookup will be in the future by testing it against all of the tournaments in the past. I have LPGA tournament data going back to 1980, and Epson Tour (or futures) data going back to 1983. I simply wrote a test app that looped through all of this data, and output the data for the tournament if it could not find a location.
This was eye opening!
Over 44 years and two tours, there are 554 distinct locations in the database. These are, from what I can tell, hand entered. And you really get a sense for the many different ways that one can represent the same city over the course of many years.
The first thing I realized, and it was staring me in the face, the geonames database gives the name of the city in ASCII in the third column. Immediately beside the field that I was using and converting non-ASCII characters to ASCII with my Go code in part 1 of this article. So I took that out and used the ASCII name instead. Easy fix.
However, then I found that some cities that I wasn't finding before, namely Galloway NJ (the one example I used), was almost in the geonames database. There's another field in the geonames database for alternate names! I decided to make this a part of my conversion process to JSON and update my code to search it as well. I then re-ran my test app to see how many cities I can't locate. It really only knocked off a few! Hmm, what's going on?
At this point, I took a look at my generated JSON. It was MASSIVE! cities500.txt is 34 MB, my cities.json file ballooned up to 44 MB! And I'm pulling in like 5 fields vs the 15 or so fields offered by geonames. So I looked closer at the data. Even more so than I did before. I noticed some cities have a large number of alternate names. But what was weird, is that the alternate names field consisted of the alternate names in various languages! Even languages where the text goes from right to left. And a lot of gibberish. Like, take this for example, Philadelphia PA in the geonames database:
Philadelphia Philadelphia Coaquannock,Fi-sang,Fi-sàng,Filadehl'fija,Filadel'fi,Filadel'fij,Filadel'fija,Filadelfi,Filadelfia,Filadelfiae
The first one being the name, the second one being the ASCII name, then the third group is the start of all of its alternate names. Philadelphia's record in geonames database is 1,087 columns wide and most of that is the alternate names field!
I played around with this by taking the first 5 alternate names. This raised problems and I was not able to locate more cities than the previous test. Eventually I settled on a variation of converting it to the ASCII equivalents, but instead just grabbed names that were already ASCII. This is a simple Go function.
func isASCII(s string) bool { for i := 0; i < len(s); i++ { if s[i] > unicode.MaxASCII { return false } } return true }
Running my test again, this was a little bit better. Back to where I was before but with a slightly smaller file. However it was still huge. I forget at this point how big it was, but it was probably still 30+ MB.
Then I got an idea. I already know all of the countries that have tournaments. What if I modified my Go program to just bring in cities from these countries? It was also pretty easy, except I have 3 character codes in my database, and geonames only has two character codes. I actually modified my test program to just output 2 letter country codes for each distinct 3 letter country code in my database.
The Go program really went through quite a transformation during this time! You can see the diff between the last two commits. With this change of only including countries that I know I will need, I was able to get my cities.json file down to about 20 MB.
On top of this change, my code to actually search this file is a lot better, as it creates a map of the cities with a key like "city_state_countrycode" and the location, where state is only non-blank if it's a US city. This map is huge, having somewhere in the vicinity of 350,000 keys! As this map has one record for each main city name, and also for each of its alternate names.
Oh but this wasn't the end. Many cities reporting as not being able to find a location. Further inspection of the LPGA cities found that the names were various. Names like "White Plains and Elmsford" were in the data, where I just wanted "White Plains" or "Elmsford". Many of these types of cities existed.
Before, I had a way to add cities manually, to take a file of cities and merge them with the cities500.txt file, to produce one cities.json file. My update here was to add a new method to be able to modify cities in cities500.txt and add alternate names. The White Plains entry is now this:
{"city":"White Plains","alternateNames":["Elmsford and White Plains","White Plains and Elmsford","Elmsford, White Plains and New Rochelle"],"country":"US","state":"NY"}
This part was the most painful and time consuming! I went through each city that I didn't have a match for, went on freakin google maps and searched for the name the LPGA gave me, and just found the closest city that was in cities500.txt! You can just right click on the map and it tells you the coordinates, at least. I had had it by this point, obviously :D Ehh, they'll be close enough. I don't think I picked a nearby city that was on the other side of a mountain range or something, and thus could have completely different weather!
In the end, my file to modify cities and add alternate names to them is around 52 cities. The manually added cities file is 30 cities. So I worked quite a bit to get the list of cities that couldn't be found down to zero! But now it's zero :)
On top of all of these changes, I also ditched the countries lookup that I was using before, and updated my code to process geonames countryInfo.txt file, which I can do the same stuff on like add countries to it. I do need to add countries, like ENG and SCT three letter codes should transform to 'GB'.
I'll definitely be modfiying this code again. I really can just run it at the beginning of every year, finding out which cities I haven't found for that year, and creating a cities.json file with only the cities that I need. This will make a file that's probably in the 1-2 MB range as opposed to 20 MB!
Side note: In my initial application, I decided to include a useful field that's in the geonames database, which is the IANA Time Zone. This is the time zone that's in the format of "America/New York", versus something like "Eastern Standard Time" or "GMT -04:00". Useful to have. BUT. What a freakin pain!! For adding manual cities, this information is simply not available ANYWHERE! So, I just gave up and set them all to "Unknown" :)
Happy coding!