dssg_banner

Exploring the Land Sealing Application Dataset

This project executes a pipeline which yields a complex dataset regarding land sealing applications across Germany. This notebook is intended to help users understand how to extract the final forms of these datasets and apply analysis.


Getting Started

Setup Colab environment

If you installed the packages and requirments on your own machine, you can skip this section and start from the import section. Otherwise you can follow and execute the tutorial on your browser. In order to start working on the notebook, click on the following button, this will open this page in the Colab environment and you will be able to execute the code on your own.

Open In Colab

Now that you are visualizing the notebook in Colab, uncomment and run the next cells to install the packages we will use and set up your collab environment. There are few things you should follow in order to properly set the notebook up: 1. Warning: This notebook was not authored by Google. Click on ‘Run anyway’. 2. When the installation commands are done, there might be “Restart runtime” button at the end of the output. Please, click it.

# %pip install pandas
# %pip install geopandas

By running the next cell you are going to create a folder in your Google Drive. All the files for this tutorial will be uploaded to this folder. After the first execution you might receive some warning and notification, please follow these instructions: 1. Permit this notebook to access your Google Drive files? Click on ‘Yes’, and select your account. 2. Google Drive for desktop wants to access your Google Account. Click on ‘Allow’.

At this point, a folder has been created and you can navigate it through the lefthand panel in Colab, you might also have received an email that informs you about the access on your Google Drive.

# Create a folder in your Google Drive
# from google.colab import drive
# drive.mount('/content/drive')
# Don't run this cell if you already cloned the repo
# !git clone https://github.com/DSSGxMunich/land-sealing-dataset-and-analysis.git
# %cd land-sealing-dataset-and-analysis

Imports

import pandas as pd ## a common datatable manipulation library
import geopandas as gpd ## an extension of pandas fo the geojson filetype
import os ## used for loading and saving files
import sys ## Used for handling imports
import matplotlib.pyplot as plt ## visualizations

sys.path.insert(0, './src') ## helps with imports in our file structure
from explorers.explorer import * ## functions that yield datasets
from data_pipeline.rplan_content_extraction.rplan_keyword_search import rplan_exact_keyword_search
from visualizations.rplan_visualization import plot_keyword_search_results

Introducing the Dataset Tables

The explorer module offers a seamless solution for accessing pre-processed and cleaned datasets designed for in-depth analysis. These dataset tables are the result of a comprehensive pipeline which are documented in the official documentation [insert link to documentation].

This process produces a range of data tables that encompass information on land parcels, local land use, regional plans, as well as crucial keywords and context extracted from each dataset table. [link to Hugging Face dataset documentation].

Within the explorer notebook,these dataset tables provide a straightforward entry point for your analytical endeavors. Whether you’re a seasoned data scientist or just starting out, the explorer module equips you with the essential tools to dive right into your analysis with ease.

Every row represents a unique land parcel, with its associated document text stored in the document_text table. This dataset also includes information specific to these land parcels, including their boundaries’ geometry, name, and ID. It’s important to note that the data we’ve collected is limited to the state of North Rhine-Westphalia (NRW).

lp = land_parcels()
lp.to_excel('data/nrw/clean/land_parcels.xlsx')
print("This table contains", len(lp), "land parcels")
lp.head(5)
This table contains 55836 land parcels
Land Parcel ID planid levelplan name kommune gkz nr besch aend aendert ... begruendurl umweltberurl erklaerungurl shape_Length shape_Area Region Plan ID regional_plan_name ART LND geometry
0 84060 DE_05382060_Siegburg_BP93/1 infra-local Im Klausgarten, Braschosser Straße, Am Kreuztor Siegburg 05382060 93/1 NaN NaN NaN ... NaN NaN NaN 868.647801 3.196032e+04 5022 Region Bonn/Rhein-Sieg Teilabschnitt 5 POLYGON ((7.28543 50.82280, 7.28728 50.82179, ...
1 559438 DE_05382036_02_32 infra-local 32. Änderung des Bebauungsplanes Nr. 2 „Much-K... Much 05382036 0 NaN 32. Änderung NaN ... https://www.much.de/zukunft/bauleitplanungen https://www.much.de/zukunft/bauleitplanungen NaN 473.229327 4.467916e+03 5022 Region Bonn/Rhein-Sieg Teilabschnitt 5 POLYGON ((7.39385 50.90281, 7.39416 50.90240, ...
2 2257588 DE_05314000_00 local Flächennutzungsplan der Bundesstadt Bonn Bonn 05314000 00 ... NaN NaN NaN 69372.039264 1.410146e+08 5022 Region Bonn/Rhein-Sieg Teilabschnitt 5 POLYGON ((7.12896 50.77292, 7.12899 50.77292, ...
3 2367967 DE_05378028_9aenderungI_Ur local 9. Änderung §34_Urschrift Rösrath 05378028 9aenderungI_Ur Breide und Durbusch Urschrift ... http://www.roesrath.de/34-9.-aenderung-breide-... 739.659941 7.348491e+03 5022 Region Bonn/Rhein-Sieg Teilabschnitt 5 MULTIPOLYGON (((7.23255 50.91855, 7.23242 50.9...
4 2367975 DE_05378028_1aenderungundUrschriftI_Ur local 1. Änderung und Urschrift §34_Urschrift Rösrath 05378028 1aenderungundUrschriftI_Ur Urschrift ... http://www.roesrath.de/34-urfassung-und-1.-aen... 56630.267941 6.082747e+06 5022 Region Bonn/Rhein-Sieg Teilabschnitt 5 MULTIPOLYGON (((7.19091 50.88535, 7.19112 50.8...

5 rows × 30 columns

This is a very large file containing the text blobs of every document extracted from the land parcel dataset’s scanurl column. Because it is so large we retrieve on a small amount of it here. The full data can be found in data\document_texts

## this data table is very large with many text blobs, so for
## efficiency we retrieve on the first 5 rows.
dt = document_texts()
## dataset found in data\document_texts

dt.to_excel('data/nrw/clean/document_texts.xlsx')
print("This table contains", len(dt), "document texts")
dt.head()
This table contains 23130 document texts
filename content metadata document_id Land Parcel ID land_parcel_name land_parcel_scanurl Document Type Code
0 116995_0.pdf \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n... {'pdf:unmappedUnicodeCharsPerPage': '0', 'pdf:... 1169950 116995 Lohhäuser Feld im Stadtteil Klein-Erkenschwick https://www.o-sp.de/download/oer-erkenschwick/... 1000.0
1 116995_10.pdf None {'pdf:unmappedUnicodeCharsPerPage': '0', 'pdf:... 11699510 116995 Lohhäuser Feld im Stadtteil Klein-Erkenschwick https://www.o-sp.de/download/oer-erkenschwick/... 1000.0
2 116995_2.pdf None {'pdf:unmappedUnicodeCharsPerPage': '0', 'pdf:... 1169952 116995 Lohhäuser Feld im Stadtteil Klein-Erkenschwick https://www.o-sp.de/download/oer-erkenschwick/... 1000.0
3 116995_4.pdf \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n... {'pdf:unmappedUnicodeCharsPerPage': ['0', '0',... 1169954 116995 Lohhäuser Feld im Stadtteil Klein-Erkenschwick https://www.o-sp.de/download/oer-erkenschwick/... 1000.0
4 116995_6.pdf None {'pdf:unmappedUnicodeCharsPerPage': '0', 'pdf:... 1169956 116995 Lohhäuser Feld im Stadtteil Klein-Erkenschwick https://www.o-sp.de/download/oer-erkenschwick/... 1000.0

We manually compiled a set of regional plan documents for NRW and extracted the text and relevant data from these files. Each row is a regional plan. One region may have multiple regional plan documents.

rp = regional_plans()
rp.to_excel('data/nrw/clean/regional_plans.xlsx')
rp.head(5)
filename chapter section section_type year Region Plan ID Name
0 arnsberg-2001-bochum_hagen freiraumentwicklung \nenergieversorgung\n2.2.4\n2.3\n2.4\n3.1\n3.2... start 2001.0 5055.0 Oberbereiche Bochum/Hagen
1 arnsberg-2001-bochum_hagen freiraum erläuterungen\nbegründet und näher beschrieben... explanation 2001.0 5055.0 Oberbereiche Bochum/Hagen
2 arnsberg-2001-bochum_hagen freiraumfunktionen ziel 1\n(1) freiraum darf nach den vorgaben de... target 2001.0 5055.0 Oberbereiche Bochum/Hagen
3 arnsberg-2001-bochum_hagen regionales ordnungskonzept und stärkung der in... erläuterung:\nder im gebietsentwicklungsplan d... explanation 2001.0 5055.0 Oberbereiche Bochum/Hagen
4 arnsberg-2001-bochum_hagen freiraum ziel 2\n(1) die siedlungsstruktur ist vorrangi... target 2001.0 5055.0 Oberbereiche Bochum/Hagen

We support two methods of keyword searching: * Exact search: in which we determine a True/False result if a given word is present or not. * Fuzzy search: In which we return the result’s before/after wording and match on the word given some amount of spelling differences (ie: compound words or mis-readings)

Exact keyword search results

kw_exact = exact_keyword(
    "data/nrw/bplan/features/keywords/exact_search/baunvo_keywords.csv"
    )
kw_exact.to_excel('data/nrw/clean/exact_keyword.xlsx')
kw_exact.head(5)
filename baunvo-1 baunvo-2 baunvo-3 baunvo-4 baunvo-4a baunvo-5 baunvo-5a baunvo-6 baunvo-6a ... baunvo-14 baunvo-15 baunvo-16 baunvo-17 baunvo-18 baunvo-19 baunvo-20 baunvo-21 baunvo-21a 13b
0 116995_0.pdf False False False False False False False False False ... False False False False False False False False False False
1 116995_10.pdf False False False False False False False False False ... False False False False False False False False False False
2 116995_2.pdf False False False False False False False False False ... False False False False False False False False False False
3 116995_4.pdf False False False False False False False False False ... False False False False False True False False False False
4 116995_6.pdf False False False False False False False False False ... False False False False False False False False False False

5 rows × 28 columns

histogram_data = kw_exact.iloc[:, 1:].sum()
histogram_data.plot(kind='bar', title='Histogram of "True" Occurrences', color='lightcoral')
<Axes: title={'center': 'Histogram of "True" Occurrences'}>
../../_images/explorer_20_1.png

Fuzzy keyword search results

kw_fuzzy = fuzzy_keyword()
kw_fuzzy.to_excel('data/nrw/clean/fuzzy_keyword.xlsx')
kw_fuzzy.sample(5,random_state=42)
Document ID firsthöhe geschossflächenzahl grundflächenzahl höhe baulicher anlagen maximale gebäudehöhe minimale gebäudehöhe traufhöhe
8323 2405906_4.pdf NaN NaN 14 absatz 1 baunvo unzulässig. 4.2.2 außerhalb... NaN NaN NaN NaN
1346 2408169_3.pdf 18 baunvo) bezugspunkte zur bestimmung der höh... NaN NaN NaN NaN NaN NaN
11270 2370506_0.pdf NaN NaN NaN trauflänge gerechnet. die vorgenannten bauteil... NaN NaN NaN
431 2395589_0.pdf dachaufbauten grundsätzlich nur in horizontale... NaN NaN NaN NaN NaN NaN
9666 2412877_0.pdf NaN NaN des räumlichen geltungsbereiches der 5. bebauu... NaN NaN NaN NaN
data_counts = kw_fuzzy.iloc[:, 1:].notna().sum()
data_counts.plot(kind='bar', title='Histogram of Categorical Occurrences', color='lightcoral')
<Axes: title={'center': 'Histogram of Categorical Occurrences'}>
../../_images/explorer_23_1.png

Knowledge agent

For a subsample of fuzzy keyword results, we run the text through GPT Turbo 3.5 to ask it to extract certain details. This table is the result of that experiment.

ka = knowledge_agent()
ka.to_excel('data/nrw/clean/knowledge_agent.xlsx')
ka.head(5)
id th_input th_agent_response th_extracted_value validation
0 id traufhöhe None NaN True
1 1956227.pdf / fh 9,5 m bei gebäuden mit zwei vollgeschosse... th: 6,5 m 6.5 True
2 1956230.pdf anderes material zulässig. 2 dächer zulässig s... None NaN True
3 2112722.pdf zulhsige grundflächenzahl (grz) (gern.§ 16 (2)... None NaN True
4 2112808.pdf der außenflächen der außenwand mit der dachhau... th: 5,00 m 5.0 True
th_extracted_counts = ka['th_extracted_value'].dropna().value_counts()
th_extracted_counts_sorted = th_extracted_counts.sort_index()
ax = th_extracted_counts_sorted.plot(kind='bar', title='Histogram of th_extracted_value', color='lightcoral')
ax.set_xlabel('th_extracted_value')
ax.set_ylabel('Frequency')
plt.yticks(range(int(th_extracted_counts_sorted.max()) + 1))
([<matplotlib.axis.YTick at 0x2b79ac3d390>,
  <matplotlib.axis.YTick at 0x2b79cf1c190>,
  <matplotlib.axis.YTick at 0x2b7a0b385d0>],
 [Text(0, 0, '0'), Text(0, 1, '1'), Text(0, 2, '2')])
../../_images/explorer_26_1.png

Missing Data Analysis at the Land Parcel Level

One of the analyses we conduct involves assessing the presence of B-plans and other pertinent documents in the land parcel entries on the NRW Geoportal. Specifically, we examine whether these documents are available as PDF files and adhere to the expected structure outlined on the website.

Below, we plot a map of NRW, color-coded to indicate the outcomes of our web scraping efforts. Successful extractions are depicted in one color, indicating that PDF files of building plans were obtained from the web portal. Conversely, failures are represented in another color, signifying instances where no PDF file was found or where the provided link diverged from the standardized format, leading to a different portal.

document_texts_df = document_texts(usecols=['land_parcel_id'])
land_parcels_gdf=land_parcels()
land_parcels_gdf['DownloadStatus_missing']=land_parcels_gdf[
    'Land Parcel ID'
    ].apply(
        lambda x: x in document_texts_df['Land Parcel ID'].values
        )
print("This table contains", len(land_parcels_gdf), "land parcels")
This table contains 55836 land parcels
missing_true_gdf = land_parcels_gdf[
    land_parcels_gdf['DownloadStatus_missing']
                                    == True]
missing_false_gdf = land_parcels_gdf[
    land_parcels_gdf['DownloadStatus_missing'] ==
                                     False]
fig, ax = plt.subplots(figsize=(12, 8))

missing_true_gdf.plot(ax=ax, color='lightcoral',
                      label='Download Unsuccessful')
missing_false_gdf.plot(ax=ax, color='lightgreen',
                       label='Download Successful')

legend_labels = ['Download Unsuccessful', 'Download Successful']
legend_handles = [plt.Line2D([0], [0], marker='o', color='w',
                             markerfacecolor='lightcoral',
                             markersize=10,
                             markeredgecolor='darkred'),
                  plt.Line2D([0], [0], marker='o', color='w',
                             markerfacecolor='lightgreen',
                             markersize=10,
                             markeredgecolor='darkgreen')]

ax.legend(legend_handles, legend_labels, loc='lower right')

ax.set_xticks([])
ax.set_yticks([])
ax.set_title(
    'Analysis of Download Status of Building Plans on Land Parcels'
    )
plt.show()
../../_images/explorer_31_0.png

Flooding Risk Analysis at the Land Parcel Level

Keywords in Question: 1. hq100 2. hqhäufig 3. hqextrem 4. vorranggebiete 5. vorbehaltsgebiete 6. betroffen_von_überschwemmung

hochwasser_df = pd.read_json("data/nrw/bplan/features/keywords/fuzzy_search/keyword_dict_hochwasser.json")
desired_categories = ['hq100', 'hqhäufig', 'hqextrem',
                      'vorranggebiete', 'vorbehaltsgebiete',
                      'betroffen_von_überschwemmung']
hochwasser_df=hochwasser_df[
    hochwasser_df['category'].isin(desired_categories)
    ]
hochwasser_df.head(5)
contextualised_keyword actual_keyword category filename
0 zum teil in diesen hinein. erdgeschossfußböden... hq100 hq100 1956227.pdf
1 nis im mai 2001 wurde die mauer auf beiden sei... hq100 hq100 2369661_1.pdf
2 (hq 100) wahrscheinlichkeiten einer überschwem... hq100 hq100 2373299_2.pdf
3 potenzielle gefahr eines hochwasserereignisses... hq100 hq100 2376399.pdf
4 hqextrem: entspricht bis auf weitere überschwe... hq100 hq100 2379698_0.pdf
document_texts_df = document_texts()
combined_df = document_texts_df.merge(hochwasser_df, on='filename')
combined_df
joined_df = land_parcels_gdf.merge(combined_df, on='Land Parcel ID',
                                   how='outer')
hochwasser_merged_df=joined_df
for category in desired_categories:
    column_name = f'{category}_presence'
    hochwasser_merged_df[column_name] = hochwasser_merged_df[
        'category'].str.contains(category, case=False, na=False)

presence_columns = ['hq100_presence', 'hqhäufig_presence',
                    'hqextrem_presence',
                    'vorranggebiete_presence',
                    'vorbehaltsgebiete_presence',
                    'betroffen_von_überschwemmung_presence']

fig, axes = plt.subplots(2, 3, figsize=(18, 12))
fig.suptitle('Presence of Categories on Land Parcel Level',
             fontsize=16)

for i, column_name in enumerate(presence_columns):
    row, col = i // 3, i % 3
    ax = axes[row, col]

    hochwasser_merged_df[hochwasser_merged_df[column_name] ==
                         False].plot(ax=ax, color='orange',
                                     markersize=10, label='Not Present'
                                     )
    hochwasser_merged_df[hochwasser_merged_df[column_name] ==
                         True].plot(ax=ax, color='green',
                                    markersize=10, label='Present')
    ax.set_xticks([])
    ax.set_yticks([])
    ax.set_title(f'Presence of "{column_name}"')


legend_handles = [plt.Line2D([0], [0], marker='o', color='w',
                             markerfacecolor='green', markersize=10),
                  plt.Line2D([0], [0], marker='o', color='w',
                             markerfacecolor='orange', markersize=10)]
legend_labels = ['Present', 'Not Present']

for ax in axes.flat:
    ax.legend(legend_handles, legend_labels, loc='lower right')

plt.tight_layout(rect=[0, 0.03, 1, 0.95])
plt.show()
../../_images/explorer_36_0.png

Extract content from regional plans

To extract content from regional plans, i.e. parse the text from the pdfs and divide them into chapters / sections. This yields the regional_plan_sections dataset. Next, we can visualise keyword occurences.

regional_plan_sections = pd.read_json("data/nrw/rplan/features/regional_plan_sections.json")
exact_result, exact_keywords = rplan_exact_keyword_search(input_df=regional_plan_sections)

plot_keyword_search_results(result_df=exact_result,
                            keyword_columns=exact_keywords,
                            title="Exact Keyword Search Results"
                            "for Regional Plans")
../../_images/explorer_38_0.png