Data Pipeline

This is the documentation for the data_pipeline to download, extract, and transform the building plans and regional plans. It contains the definitions for all methods necessary to run the pipeline.

Download NRW building plan PDFs

The methods used here are demonstrated in the land_parcels_demo.ipynb notebook.

parse_geojson(file_path, output_path, sample_n=None) → DataFrame[source]

Parse geojson file from file_path and write it to output_path.

This function parses the geojson file from file_path and writes it to output_path. If sample_n is not None, the geojson is sampled to sample_n rows. The function parse_non_downloadable_links is called to parse the links from the scanurl column. It adds all sub-links that where listed in the original dataframe and start with https://www.o-sp.de/download/ or https://gisdata.krzn.de/files/bplan to the dataframe. The objectid is extended with the index of the link.

Parameters:

file_path – path to geojson file
output_path – path to output file
sample_n – number of rows to sample

Returns:

dataframe with all links and sub-links

Return type:

pd.DataFrame

run_pdf_downloader(input_df: DataFrame, output_folder='../../data/NRW/pdfs', sample_n: int = None)[source]

This function takes as input a dataframe with the links to the PDFs and downloads them to the output folder.

Parameters:

input_df (pd.DataFrame) – DataFrame that contains the links to the PDFs, with the columns “scanurl” and “objectid”
output_folder (str) – Path to the folder where the PDFs will be saved
sample_n (int) – Number of rows to sample from the input_df. If None, all rows are used.

merge_rp_bp(path_bp_geo='../data/NRW/NRW_BP.geojson', path_rp_geo='../data/regional_plans/regions_map.geojson') → GeoDataFrame[source]

Merge the BP and RP geojson files into one .

Parameters:

path_bp_geo – path to the BP geojson file
path_rp_geo – path to the RP geojson file

Returns:

GeoDataFrame with the overlapped BP and RP geojson files

Return type:

gpd.GeoDataFrame

export_merged_bp_rp(output_path, path_bp_geo='../data/NRW/NRW_BP.geojson', path_rp_geo='../data/regional_plans/regions_map.geojson')[source]

Export the merged BP and RP geojson files into one . Wrapper function for merge_rp_bp

Parameters:

output_path – path to the output file
path_bp_geo – path to the BP geojson file
path_rp_geo – path to the RP geojson file

Returns:

GeoDataFrame with the overlapped BP and RP geojson files

Return type:

gpd.GeoDataFrame

Extract Text from PDFs

pdf_parser_from_folder(folder_path: str, sample_size: int = None) → DataFrame[source]

Apply pdf_parser_from_path function to full folder

Parameters:: folder_path – full input folder path as string
Returns:: df containing filename, content and metadata per pdf
Return type:: df

pdf_parser_from_path(pdf_path: str) → dict[source]

Parse pdf and extract content and metadata

Parameters:: pdf_path – path as string
Returns:: dictionary containing content and metadata
Return type:: parsed

Parsing the regional plans

The RPlan Converter is used to extract the content from the regional plan PDFs.

class RPlanContentExtractor(rplan_config)[source]

Bases: object

extract_chapter_names(txt, cfg, margin: float = 0.1, toc_end_index: int = None)[source]

Extracts the chapter names from the textfile.

This method extracts the chapter names from the textfile. The chapter names are usually listed at the beginning of the textfile, therefore the margin. The chapter names are used to assign each section to a chapter.

Parameters:

txt – the rplan content as string
cfg – the rplan config as dictionary, for keys of the dict see init method
margin – the margin as float, the chapter names are extracted from the first margin% of the textfile. Not used if toc_end_index is specified
toc_end_index – the index where the table of contents ends, if None, the margin is used

Returns:

list of chapter names as strings txt: the rplan content as string

Return type:

chapter_names

find_chapter_name_for_indices(indices, chapter_names, txt)[source]: Finds the chapter name for each index.

parse_into_sections(txt: str, cfg: dict, chapter_names: list) → DataFrame[source]

Parses the rplan content into sections.

This method parses the rplan content into sections. The sections are the targets, principles and explanations. The indices of the sections are found by the markers, which are defined in the rplan_config.yml file. The chapters are used to assign each section to a chapter.

Parameters:

txt – the rplan content as string
cfg – the rplan config as dictionary, for keys of the dict see init method
chapter_names – list of chapter names as strings

Returns:

dataframe with columns chapter and section

Return type:

result_df

parse_rplan_from_textfile(txt_path: str) → DataFrame[source]

Parses a rplan textfile into a dataframe with columns chapter and section

This method extracts the chapters, targets, principles and explanations from a rplan textfile. The extraction is based on the rplan_config.yml file, where regular expressions are given for each rplan and the specific task. The textfile is preprocessed before the extraction, e.g. lowered, removal of newlines. The chapters are extracted from the first 10% of the textfile, as the chapters are usually listed at the beginning. The chapters are then used to assign each section to a chapter.

Parameters:: txt_path – path to the rplan textfile
Returns:: dataframe with columns filename, chaptername and section
Return type:: sections_df

preprocess_rplan_content(content: str)[source]

Preprocesses the rplan content

This method preprocesses the rplan content by removing all whitespaces, newlines and special characters.

Parameters:: content – the rplan content as string
Returns:: the preprocessed rplan content as string
Return type:: content

read_text(txt_path)[source]

It utilizes the following methods:

parse_rplan_directory(txt_dir_path: str, json_output_path: str = None)[source]

Parses a directory with rplan textfiles into a dataframe with columns chapter and section

This method extracts the chapters, targets, principles and explanations from a rplan textfile. The extraction is based on the rplan_config.yml file, where regular expressions are given for each rplan and the specific task. The textfile is preprocessed before the extraction, e.g. lowered, removal of newlines. The chapters are extracted from the first 10% of the textfile, as the chapters are usually listed at the beginning. The chapters are then used to assign each section to a chapter. The dataframe is then saved to a json file.

Parameters:

txt_dir_path – path to the directory with the rplan textfiles
json_output_path – path to the output json file

Returns:

dataframe with columns filename, chapter and section

Return type:

sections_df

parse_result_df(df)[source]

Parses the result df from the rplan extractor.

Adds the year and the region to the df.

Parameters:: df – pd.DataFrame with columns [‘filename’,…]
Returns:: pd.DataFrame with columns [‘filename’, …, ‘year’]