Data Pipeline

This is the documentation for the data_pipeline to download, extract, and transform the building plans and regional plans. It contains the definitions for all methods necessary to run the pipeline.

Download NRW building plan PDFs

The methods used here are demonstrated in the land_parcels_demo.ipynb notebook.

parse_geojson(file_path, output_path, sample_n=None) DataFrame[source]

Parse geojson file from file_path and write it to output_path.

This function parses the geojson file from file_path and writes it to output_path. If sample_n is not None, the geojson is sampled to sample_n rows. The function parse_non_downloadable_links is called to parse the links from the scanurl column. It adds all sub-links that where listed in the original dataframe and start with https://www.o-sp.de/download/ or https://gisdata.krzn.de/files/bplan to the dataframe. The objectid is extended with the index of the link.

Parameters:
  • file_path – path to geojson file

  • output_path – path to output file

  • sample_n – number of rows to sample

Returns:

dataframe with all links and sub-links

Return type:

pd.DataFrame

run_pdf_downloader(input_df: DataFrame, output_folder='../../data/NRW/pdfs', sample_n: int = None)[source]

This function takes as input a dataframe with the links to the PDFs and downloads them to the output folder.

Parameters:
  • input_df (pd.DataFrame) – DataFrame that contains the links to the PDFs, with the columns “scanurl” and “objectid”

  • output_folder (str) – Path to the folder where the PDFs will be saved

  • sample_n (int) – Number of rows to sample from the input_df. If None, all rows are used.

merge_rp_bp(path_bp_geo='../data/NRW/NRW_BP.geojson', path_rp_geo='../data/regional_plans/regions_map.geojson') GeoDataFrame[source]

Merge the BP and RP geojson files into one .

Parameters:
  • path_bp_geo – path to the BP geojson file

  • path_rp_geo – path to the RP geojson file

Returns:

GeoDataFrame with the overlapped BP and RP geojson files

Return type:

gpd.GeoDataFrame

export_merged_bp_rp(output_path, path_bp_geo='../data/NRW/NRW_BP.geojson', path_rp_geo='../data/regional_plans/regions_map.geojson')[source]

Export the merged BP and RP geojson files into one . Wrapper function for merge_rp_bp

Parameters:
  • output_path – path to the output file

  • path_bp_geo – path to the BP geojson file

  • path_rp_geo – path to the RP geojson file

Returns:

GeoDataFrame with the overlapped BP and RP geojson files

Return type:

gpd.GeoDataFrame

Extract Text from PDFs

pdf_parser_from_folder(folder_path: str, sample_size: int = None) DataFrame[source]

Apply pdf_parser_from_path function to full folder

Parameters:

folder_path – full input folder path as string

Returns:

df containing filename, content and metadata per pdf

Return type:

df

pdf_parser_from_path(pdf_path: str) dict[source]

Parse pdf and extract content and metadata

Parameters:

pdf_path – path as string

Returns:

dictionary containing content and metadata

Return type:

parsed

Parsing the regional plans

The RPlan Converter is used to extract the content from the regional plan PDFs.

class RPlanContentExtractor(rplan_config)[source]

Bases: object

extract_chapter_names(txt, cfg, margin: float = 0.1, toc_end_index: int = None)[source]

Extracts the chapter names from the textfile.

This method extracts the chapter names from the textfile. The chapter names are usually listed at the beginning of the textfile, therefore the margin. The chapter names are used to assign each section to a chapter.

Parameters:
  • txt – the rplan content as string

  • cfg – the rplan config as dictionary, for keys of the dict see init method

  • margin – the margin as float, the chapter names are extracted from the first margin% of the textfile. Not used if toc_end_index is specified

  • toc_end_index – the index where the table of contents ends, if None, the margin is used

Returns:

list of chapter names as strings txt: the rplan content as string

Return type:

chapter_names

find_chapter_name_for_indices(indices, chapter_names, txt)[source]

Finds the chapter name for each index.

parse_into_sections(txt: str, cfg: dict, chapter_names: list) DataFrame[source]

Parses the rplan content into sections.

This method parses the rplan content into sections. The sections are the targets, principles and explanations. The indices of the sections are found by the markers, which are defined in the rplan_config.yml file. The chapters are used to assign each section to a chapter.

Parameters:
  • txt – the rplan content as string

  • cfg – the rplan config as dictionary, for keys of the dict see init method

  • chapter_names – list of chapter names as strings

Returns:

dataframe with columns chapter and section

Return type:

result_df

parse_rplan_from_textfile(txt_path: str) DataFrame[source]

Parses a rplan textfile into a dataframe with columns chapter and section

This method extracts the chapters, targets, principles and explanations from a rplan textfile. The extraction is based on the rplan_config.yml file, where regular expressions are given for each rplan and the specific task. The textfile is preprocessed before the extraction, e.g. lowered, removal of newlines. The chapters are extracted from the first 10% of the textfile, as the chapters are usually listed at the beginning. The chapters are then used to assign each section to a chapter.

Parameters:

txt_path – path to the rplan textfile

Returns:

dataframe with columns filename, chaptername and section

Return type:

sections_df

preprocess_rplan_content(content: str)[source]

Preprocesses the rplan content

This method preprocesses the rplan content by removing all whitespaces, newlines and special characters.

Parameters:

content – the rplan content as string

Returns:

the preprocessed rplan content as string

Return type:

content

read_text(txt_path)[source]

It utilizes the following methods:

parse_rplan_directory(txt_dir_path: str, json_output_path: str = None)[source]

Parses a directory with rplan textfiles into a dataframe with columns chapter and section

This method extracts the chapters, targets, principles and explanations from a rplan textfile. The extraction is based on the rplan_config.yml file, where regular expressions are given for each rplan and the specific task. The textfile is preprocessed before the extraction, e.g. lowered, removal of newlines. The chapters are extracted from the first 10% of the textfile, as the chapters are usually listed at the beginning. The chapters are then used to assign each section to a chapter. The dataframe is then saved to a json file.

Parameters:
  • txt_dir_path – path to the directory with the rplan textfiles

  • json_output_path – path to the output json file

Returns:

dataframe with columns filename, chapter and section

Return type:

sections_df

parse_result_df(df)[source]

Parses the result df from the rplan extractor.

Adds the year and the region to the df.

Parameters:

df – pd.DataFrame with columns [‘filename’,…]

Returns:

pd.DataFrame with columns [‘filename’, …, ‘year’]