Data Pipeline
This is the documentation for the data_pipeline to download, extract, and transform the building plans and regional plans. It contains the definitions for all methods necessary to run the pipeline.
Download NRW building plan PDFs
The methods used here are demonstrated in the land_parcels_demo.ipynb notebook.
- parse_geojson(file_path, output_path, sample_n=None) DataFrame[source]
Parse geojson file from file_path and write it to output_path.
This function parses the geojson file from file_path and writes it to output_path. If sample_n is not None, the geojson is sampled to sample_n rows. The function parse_non_downloadable_links is called to parse the links from the scanurl column. It adds all sub-links that where listed in the original dataframe and start with https://www.o-sp.de/download/ or https://gisdata.krzn.de/files/bplan to the dataframe. The objectid is extended with the index of the link.
- Parameters:
file_path – path to geojson file
output_path – path to output file
sample_n – number of rows to sample
- Returns:
dataframe with all links and sub-links
- Return type:
pd.DataFrame
- run_pdf_downloader(input_df: DataFrame, output_folder='../../data/NRW/pdfs', sample_n: int = None)[source]
This function takes as input a dataframe with the links to the PDFs and downloads them to the output folder.
- Parameters:
input_df (pd.DataFrame) – DataFrame that contains the links to the PDFs, with the columns “scanurl” and “objectid”
output_folder (str) – Path to the folder where the PDFs will be saved
sample_n (int) – Number of rows to sample from the input_df. If None, all rows are used.
- merge_rp_bp(path_bp_geo='../data/NRW/NRW_BP.geojson', path_rp_geo='../data/regional_plans/regions_map.geojson') GeoDataFrame[source]
Merge the BP and RP geojson files into one .
- Parameters:
path_bp_geo – path to the BP geojson file
path_rp_geo – path to the RP geojson file
- Returns:
GeoDataFrame with the overlapped BP and RP geojson files
- Return type:
gpd.GeoDataFrame
- export_merged_bp_rp(output_path, path_bp_geo='../data/NRW/NRW_BP.geojson', path_rp_geo='../data/regional_plans/regions_map.geojson')[source]
Export the merged BP and RP geojson files into one . Wrapper function for merge_rp_bp
- Parameters:
output_path – path to the output file
path_bp_geo – path to the BP geojson file
path_rp_geo – path to the RP geojson file
- Returns:
GeoDataFrame with the overlapped BP and RP geojson files
- Return type:
gpd.GeoDataFrame
Extract Text from PDFs
Parsing the regional plans
The RPlan Converter is used to extract the content from the regional plan PDFs.
- class RPlanContentExtractor(rplan_config)[source]
Bases:
object- extract_chapter_names(txt, cfg, margin: float = 0.1, toc_end_index: int = None)[source]
Extracts the chapter names from the textfile.
This method extracts the chapter names from the textfile. The chapter names are usually listed at the beginning of the textfile, therefore the margin. The chapter names are used to assign each section to a chapter.
- Parameters:
txt – the rplan content as string
cfg – the rplan config as dictionary, for keys of the dict see init method
margin – the margin as float, the chapter names are extracted from the first margin% of the textfile. Not used if toc_end_index is specified
toc_end_index – the index where the table of contents ends, if None, the margin is used
- Returns:
list of chapter names as strings txt: the rplan content as string
- Return type:
chapter_names
- find_chapter_name_for_indices(indices, chapter_names, txt)[source]
Finds the chapter name for each index.
- parse_into_sections(txt: str, cfg: dict, chapter_names: list) DataFrame[source]
Parses the rplan content into sections.
This method parses the rplan content into sections. The sections are the targets, principles and explanations. The indices of the sections are found by the markers, which are defined in the rplan_config.yml file. The chapters are used to assign each section to a chapter.
- Parameters:
txt – the rplan content as string
cfg – the rplan config as dictionary, for keys of the dict see init method
chapter_names – list of chapter names as strings
- Returns:
dataframe with columns chapter and section
- Return type:
result_df
- parse_rplan_from_textfile(txt_path: str) DataFrame[source]
Parses a rplan textfile into a dataframe with columns chapter and section
This method extracts the chapters, targets, principles and explanations from a rplan textfile. The extraction is based on the rplan_config.yml file, where regular expressions are given for each rplan and the specific task. The textfile is preprocessed before the extraction, e.g. lowered, removal of newlines. The chapters are extracted from the first 10% of the textfile, as the chapters are usually listed at the beginning. The chapters are then used to assign each section to a chapter.
- Parameters:
txt_path – path to the rplan textfile
- Returns:
dataframe with columns filename, chaptername and section
- Return type:
sections_df
- preprocess_rplan_content(content: str)[source]
Preprocesses the rplan content
This method preprocesses the rplan content by removing all whitespaces, newlines and special characters.
- Parameters:
content – the rplan content as string
- Returns:
the preprocessed rplan content as string
- Return type:
content
It utilizes the following methods:
- parse_rplan_directory(txt_dir_path: str, json_output_path: str = None)[source]
Parses a directory with rplan textfiles into a dataframe with columns chapter and section
This method extracts the chapters, targets, principles and explanations from a rplan textfile. The extraction is based on the rplan_config.yml file, where regular expressions are given for each rplan and the specific task. The textfile is preprocessed before the extraction, e.g. lowered, removal of newlines. The chapters are extracted from the first 10% of the textfile, as the chapters are usually listed at the beginning. The chapters are then used to assign each section to a chapter. The dataframe is then saved to a json file.
- Parameters:
txt_dir_path – path to the directory with the rplan textfiles
json_output_path – path to the output json file
- Returns:
dataframe with columns filename, chapter and section
- Return type:
sections_df