samedi 25 juin 2016

Labelled datatypes Python


I am computing geodesic distances between a point and multiple line segments. Each line segment has a unique identifying number. I want to return distances from my distances function such that they are both intrinsically tied together. I would also like to maintain functionality, as in sort the distances, and index them with either the label or position, and get back both the distance data and the label. Something like a Pandas Series with an index, but I cannot use a series because the data is returned into a Pandas DataFrame, which then expands the series and makes a mess. Here is an example: In [1]: '''Note that all this happens inside an apply function of a Pandas Series''' labels = [25622, 25621, 25620, 25619, 25618] dist = vect_dist_funct(pt, labels) #vect_dist_funct does the computations, and returns distances in meters dist Out[1]: array([296780.2217658355, 296572.4476883276, 296364.21166884096, 296156.4366241771, 295948.6610171968], dtype=object) What I want however, is something like this dict, where the labels and distances are inherently tied to each other: {25622 : 296780.2217658355, 25621 : 296572.4476883276, 25620 : 296364.21166884096, 25619 : 296156.4366241771, 25618 : 295948.6610171968} But now I have lost functionality of the values. I cannot easily sort them, or compare them, or anything. I looked at Numpy Structured Arrays, and they seem workable, but if I am not able to sort the distances, and get the index of the closest segment, it will not be of much use to me. Is there any other datatype that I can use? Long Story and Background I am trying to do a spatial join. I get the indexes of the segments a point is most likely closer to by searching in a RTree (example). Those are the indexes in labels. Then I look through the line geometries table to find the line geometry for those selected labels, and compute the distances of the points to each of the line segment. Next steps involve sanity checking the spatial join. Nearest is not the best join candidate in some cases, and the join needs to be evaluated on other parameters. Therefore, my plan is to work from closest segment outward. Which would involve sorting on the distances, and getting the indexes of the closest segment, then looking through the segment table with that index and extracting other properties of the line for inspection. If a match can be confirmed, the said segment is accepted, else, it is rejected, and the algorithm would move to the next closest segment. A data type that does all this is what I am looking for, without breaking the link between the distances the segment from which it was computed. Problem With Using Pandas So this is how the function is actually being called: joined = points['geometry'].apply(pointer, centroid=line['centroid'], tree_idx=tree_idx)) Then inside pointer, this happens: def pointer(point, centroid, tree_idx): intersect = list(tree_idx.intersection(point.bounds)) if len(intersect) > 0: points = pd.Series([point.coords[0]]*len(intersect)).values polygons = centroid.loc[intersect].values dist = vect_dist_funct(points, polygons) return pd.Series(dist, index=intercept, name='Dist').sort_values() else: return pd.Series(np.nan, index=[0], name='Dist') And then, joined looks like this: This is because distances between all points (the rows are points) and all lines (the columns are lines) are not computed. That would be too cost prohibitive (4M points, and 180k lines per state, and 50 states on the whole dataset). Also, this DataFrame merge operation to produced joined increases the run time 7 times, compared to when I return two Numpy arrays. The problem with returning two Numpy arrays is that it is not easy to keep the distance and the line IDs aligned all the time. Examples of Points, Lines, tree_idx Note that this is truncated dataset in columns and rows. I am only including the columns of relevance, and not the rest of the data: points: geometry id 88400001394219 0.00 POINT (-105.2363291 39.6988139) 0.25 POINT (-105.2372017334178 39.69899060448157) 0.50 POINT (-105.2380177896182 39.69933953105642) 0.75 POINT (-105.2387202141595 39.69988447162143) 1.00 POINT (-105.2393222 39.7005405) 88400002400701 0.00 POINT (-104.7102833 39.8318348) 0.25 POINT (-104.7102827 39.831966625) 0.50 POINT (-104.7102821 39.83209845) 0.75 POINT (-104.7102815 39.832230275) 1.00 POINT (-104.7102809 39.8323621) So this is basically interpolated points on lines. The line id is the first level of index, and the second level is the percent where the point was interpolated. This forms the first dataset, the dataset to which I want to bring some attributes from the second dataset. line: geometry centroid id 71345 POLYGON ((-103.2077992965318 40.58026765162965... (-103.20073265160862, 40.576450381964975) 71346 POLYGON ((-103.2069505830457 40.58155121711739... (-103.19987394433825, 40.57774903464972) 71347 POLYGON ((-103.2061017677045 40.58283487609803... (-103.19901204453959, 40.57905245493993) 71348 POLYGON ((-103.2052000154291 40.58419853220472... (-103.19815200508097, 40.58035300329024) 71349 POLYGON ((-103.2043512639656 40.58548197865339... (-103.19729445792181, 40.58164972491414) 71350 POLYGON ((-103.2035025651746 40.5867652936463,... (-103.1964362470977, 40.5829473948391) 71351 POLYGON ((-103.2026535431035 40.58804903349249... (-103.19557847342394, 40.58424434094705) 71352 POLYGON ((-103.201804801526 40.58933229190573,... (-103.19472966696722, 40.58552767098465) 71353 POLYGON ((-103.2009557884142 40.59061590473365... (-103.19388484652855, 40.58680427447224) 71354 POLYGON ((-103.2001001699726 40.59190793446012... (-103.19303392095904, 40.5880882237994) This is part of the second dataset (the labels mentioned at the beginning of this answer is the index of this dataset). The goal is to transfer attributes from this dataset to the points dataset, in an intelligent manner. The first step of which is to find the nearest line to each of the points. Then I will compare some attributes from the points dataset with the lines dataset, and confirm or reject a join, like I mentioned. tree_idx: tree_idx is created using the following code: import rtree lines_bounds = lines['geometry'].apply(lambda x: x.bounds) tree_idx = rtree.index.Index() for i in lines_bounds.index: tree_idx.insert(i, lines_bounds.loc[i])

Aucun commentaire:

Enregistrer un commentaire