Exploring the Land Sealing Application Dataset

This project executes a pipeline which yields a complex dataset regarding land sealing applications across Germany. This notebook is intended to help users understand how to extract the final forms of these datasets and apply analysis.

Getting Started

Setup Colab environment

If you installed the packages and requirments on your own machine, you can skip this section and start from the import section. Otherwise you can follow and execute the tutorial on your browser. In order to start working on the notebook, click on the following button, this will open this page in the Colab environment and you will be able to execute the code on your own.

Now that you are visualizing the notebook in Colab, uncomment and run the next cells to install the packages we will use and set up your collab environment. There are few things you should follow in order to properly set the notebook up: 1. Warning: This notebook was not authored by Google. Click on ‘Run anyway’. 2. When the installation commands are done, there might be “Restart runtime” button at the end of the output. Please, click it.

# %pip install pandas
# %pip install geopandas

By running the next cell you are going to create a folder in your Google Drive. All the files for this tutorial will be uploaded to this folder. After the first execution you might receive some warning and notification, please follow these instructions: 1. Permit this notebook to access your Google Drive files? Click on ‘Yes’, and select your account. 2. Google Drive for desktop wants to access your Google Account. Click on ‘Allow’.

At this point, a folder has been created and you can navigate it through the lefthand panel in Colab, you might also have received an email that informs you about the access on your Google Drive.

# Create a folder in your Google Drive
# from google.colab import drive
# drive.mount('/content/drive')

# Don't run this cell if you already cloned the repo
# !git clone https://github.com/DSSGxMunich/land-sealing-dataset-and-analysis.git

# %cd land-sealing-dataset-and-analysis

Imports

import pandas as pd ## a common datatable manipulation library
import geopandas as gpd ## an extension of pandas fo the geojson filetype
import os ## used for loading and saving files
import sys ## Used for handling imports
import matplotlib.pyplot as plt ## visualizations

sys.path.insert(0, './src') ## helps with imports in our file structure
from explorers.explorer import * ## functions that yield datasets
from data_pipeline.rplan_content_extraction.rplan_keyword_search import rplan_exact_keyword_search
from visualizations.rplan_visualization import plot_keyword_search_results

Introducing the Dataset Tables

The explorer module offers a seamless solution for accessing pre-processed and cleaned datasets designed for in-depth analysis. These dataset tables are the result of a comprehensive pipeline which are documented in the official documentation [insert link to documentation].

This process produces a range of data tables that encompass information on land parcels, local land use, regional plans, as well as crucial keywords and context extracted from each dataset table. [link to Hugging Face dataset documentation].

Within the explorer notebook,these dataset tables provide a straightforward entry point for your analytical endeavors. Whether you’re a seasoned data scientist or just starting out, the explorer module equips you with the essential tools to dive right into your analysis with ease.

Every row represents a unique land parcel, with its associated document text stored in the document_text table. This dataset also includes information specific to these land parcels, including their boundaries’ geometry, name, and ID. It’s important to note that the data we’ve collected is limited to the state of North Rhine-Westphalia (NRW).

lp = land_parcels()
lp.to_excel('data/nrw/clean/land_parcels.xlsx')
print("This table contains", len(lp), "land parcels")
lp.head(5)

This table contains 55836 land parcels

	Land Parcel ID	planid	levelplan	name	kommune	gkz	nr	besch	aend	aendert	...	begruendurl	umweltberurl	erklaerungurl	shape_Length	shape_Area	Region Plan ID	regional_plan_name	ART	LND	geometry
0	84060	DE_05382060_Siegburg_BP93/1	infra-local	Im Klausgarten, Braschosser Straße, Am Kreuztor	Siegburg	05382060	93/1	NaN	NaN	NaN	...	NaN	NaN	NaN	868.647801	3.196032e+04	5022	Region Bonn/Rhein-Sieg	Teilabschnitt	5	POLYGON ((7.28543 50.82280, 7.28728 50.82179, ...
1	559438	DE_05382036_02_32	infra-local	32. Änderung des Bebauungsplanes Nr. 2 „Much-K...	Much	05382036	0	NaN	32. Änderung	NaN	...	https://www.much.de/zukunft/bauleitplanungen	https://www.much.de/zukunft/bauleitplanungen	NaN	473.229327	4.467916e+03	5022	Region Bonn/Rhein-Sieg	Teilabschnitt	5	POLYGON ((7.39385 50.90281, 7.39416 50.90240, ...
2	2257588	DE_05314000_00	local	Flächennutzungsplan der Bundesstadt Bonn	Bonn	05314000	00				...	NaN	NaN	NaN	69372.039264	1.410146e+08	5022	Region Bonn/Rhein-Sieg	Teilabschnitt	5	POLYGON ((7.12896 50.77292, 7.12899 50.77292, ...
3	2367967	DE_05378028_9aenderungI_Ur	local	9. Änderung §34_Urschrift	Rösrath	05378028	9aenderungI_Ur	Breide und Durbusch	Urschrift		...	http://www.roesrath.de/34-9.-aenderung-breide-...			739.659941	7.348491e+03	5022	Region Bonn/Rhein-Sieg	Teilabschnitt	5	MULTIPOLYGON (((7.23255 50.91855, 7.23242 50.9...
4	2367975	DE_05378028_1aenderungundUrschriftI_Ur	local	1. Änderung und Urschrift §34_Urschrift	Rösrath	05378028	1aenderungundUrschriftI_Ur		Urschrift		...	http://www.roesrath.de/34-urfassung-und-1.-aen...			56630.267941	6.082747e+06	5022	Region Bonn/Rhein-Sieg	Teilabschnitt	5	MULTIPOLYGON (((7.19091 50.88535, 7.19112 50.8...

5 rows × 30 columns

This is a very large file containing the text blobs of every document extracted from the land parcel dataset’s scanurl column. Because it is so large we retrieve on a small amount of it here. The full data can be found in data\document_texts

## this data table is very large with many text blobs, so for
## efficiency we retrieve on the first 5 rows.
dt = document_texts()
## dataset found in data\document_texts

dt.to_excel('data/nrw/clean/document_texts.xlsx')
print("This table contains", len(dt), "document texts")
dt.head()

This table contains 23130 document texts

	filename	content	metadata	document_id	Land Parcel ID	land_parcel_name	land_parcel_scanurl	Document Type Code
0	116995_0.pdf	\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n...	{'pdf:unmappedUnicodeCharsPerPage': '0', 'pdf:...	1169950	116995	Lohhäuser Feld im Stadtteil Klein-Erkenschwick	https://www.o-sp.de/download/oer-erkenschwick/...	1000.0
1	116995_10.pdf	None	{'pdf:unmappedUnicodeCharsPerPage': '0', 'pdf:...	11699510	116995	Lohhäuser Feld im Stadtteil Klein-Erkenschwick	https://www.o-sp.de/download/oer-erkenschwick/...	1000.0
2	116995_2.pdf	None	{'pdf:unmappedUnicodeCharsPerPage': '0', 'pdf:...	1169952	116995	Lohhäuser Feld im Stadtteil Klein-Erkenschwick	https://www.o-sp.de/download/oer-erkenschwick/...	1000.0
3	116995_4.pdf	\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n...	{'pdf:unmappedUnicodeCharsPerPage': ['0', '0',...	1169954	116995	Lohhäuser Feld im Stadtteil Klein-Erkenschwick	https://www.o-sp.de/download/oer-erkenschwick/...	1000.0
4	116995_6.pdf	None	{'pdf:unmappedUnicodeCharsPerPage': '0', 'pdf:...	1169956	116995	Lohhäuser Feld im Stadtteil Klein-Erkenschwick	https://www.o-sp.de/download/oer-erkenschwick/...	1000.0

We manually compiled a set of regional plan documents for NRW and extracted the text and relevant data from these files. Each row is a regional plan. One region may have multiple regional plan documents.

rp = regional_plans()
rp.to_excel('data/nrw/clean/regional_plans.xlsx')
rp.head(5)

	filename	chapter	section	section_type	year	Region Plan ID	Name
0	arnsberg-2001-bochum_hagen	freiraumentwicklung	\nenergieversorgung\n2.2.4\n2.3\n2.4\n3.1\n3.2...	start	2001.0	5055.0	Oberbereiche Bochum/Hagen
1	arnsberg-2001-bochum_hagen	freiraum	erläuterungen\nbegründet und näher beschrieben...	explanation	2001.0	5055.0	Oberbereiche Bochum/Hagen
2	arnsberg-2001-bochum_hagen	freiraumfunktionen	ziel 1\n(1) freiraum darf nach den vorgaben de...	target	2001.0	5055.0	Oberbereiche Bochum/Hagen
3	arnsberg-2001-bochum_hagen	regionales ordnungskonzept und stärkung der in...	erläuterung:\nder im gebietsentwicklungsplan d...	explanation	2001.0	5055.0	Oberbereiche Bochum/Hagen
4	arnsberg-2001-bochum_hagen	freiraum	ziel 2\n(1) die siedlungsstruktur ist vorrangi...	target	2001.0	5055.0	Oberbereiche Bochum/Hagen

We support two methods of keyword searching: * Exact search: in which we determine a True/False result if a given word is present or not. * Fuzzy search: In which we return the result’s before/after wording and match on the word given some amount of spelling differences (ie: compound words or mis-readings)

Exact keyword search results

kw_exact = exact_keyword(
    "data/nrw/bplan/features/keywords/exact_search/baunvo_keywords.csv"
    )
kw_exact.to_excel('data/nrw/clean/exact_keyword.xlsx')
kw_exact.head(5)

	filename	baunvo-1	baunvo-2	baunvo-3	baunvo-4	baunvo-4a	baunvo-5	baunvo-5a	baunvo-6	baunvo-6a	...	baunvo-14	baunvo-15	baunvo-16	baunvo-17	baunvo-18	baunvo-19	baunvo-20	baunvo-21	baunvo-21a	13b
0	116995_0.pdf	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
1	116995_10.pdf	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
2	116995_2.pdf	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
3	116995_4.pdf	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	True	False	False	False	False
4	116995_6.pdf	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False

5 rows × 28 columns

histogram_data = kw_exact.iloc[:, 1:].sum()
histogram_data.plot(kind='bar', title='Histogram of "True" Occurrences', color='lightcoral')

<Axes: title={'center': 'Histogram of "True" Occurrences'}>

Fuzzy keyword search results

kw_fuzzy = fuzzy_keyword()
kw_fuzzy.to_excel('data/nrw/clean/fuzzy_keyword.xlsx')
kw_fuzzy.sample(5,random_state=42)

	Document ID	firsthöhe	geschossflächenzahl	grundflächenzahl	höhe baulicher anlagen	maximale gebäudehöhe	minimale gebäudehöhe	traufhöhe
8323	2405906_4.pdf	NaN	NaN	14 absatz 1 baunvo unzulässig. 4.2.2 außerhalb...	NaN	NaN	NaN	NaN
1346	2408169_3.pdf	18 baunvo) bezugspunkte zur bestimmung der höh...	NaN	NaN	NaN	NaN	NaN	NaN
11270	2370506_0.pdf	NaN	NaN	NaN	trauflänge gerechnet. die vorgenannten bauteil...	NaN	NaN	NaN
431	2395589_0.pdf	dachaufbauten grundsätzlich nur in horizontale...	NaN	NaN	NaN	NaN	NaN	NaN
9666	2412877_0.pdf	NaN	NaN	des räumlichen geltungsbereiches der 5. bebauu...	NaN	NaN	NaN	NaN

data_counts = kw_fuzzy.iloc[:, 1:].notna().sum()
data_counts.plot(kind='bar', title='Histogram of Categorical Occurrences', color='lightcoral')

<Axes: title={'center': 'Histogram of Categorical Occurrences'}>

Knowledge agent

For a subsample of fuzzy keyword results, we run the text through GPT Turbo 3.5 to ask it to extract certain details. This table is the result of that experiment.

ka = knowledge_agent()
ka.to_excel('data/nrw/clean/knowledge_agent.xlsx')
ka.head(5)

	id	th_input	th_agent_response	th_extracted_value	validation
0	id	traufhöhe	None	NaN	True
1	1956227.pdf	/ fh 9,5 m bei gebäuden mit zwei vollgeschosse...	th: 6,5 m	6.5	True
2	1956230.pdf	anderes material zulässig. 2 dächer zulässig s...	None	NaN	True
3	2112722.pdf	zulhsige grundflächenzahl (grz) (gern.§ 16 (2)...	None	NaN	True
4	2112808.pdf	der außenflächen der außenwand mit der dachhau...	th: 5,00 m	5.0	True

th_extracted_counts = ka['th_extracted_value'].dropna().value_counts()
th_extracted_counts_sorted = th_extracted_counts.sort_index()
ax = th_extracted_counts_sorted.plot(kind='bar', title='Histogram of th_extracted_value', color='lightcoral')
ax.set_xlabel('th_extracted_value')
ax.set_ylabel('Frequency')
plt.yticks(range(int(th_extracted_counts_sorted.max()) + 1))

([<matplotlib.axis.YTick at 0x2b79ac3d390>,
  <matplotlib.axis.YTick at 0x2b79cf1c190>,
  <matplotlib.axis.YTick at 0x2b7a0b385d0>],
 [Text(0, 0, '0'), Text(0, 1, '1'), Text(0, 2, '2')])

Missing Data Analysis at the Land Parcel Level

One of the analyses we conduct involves assessing the presence of B-plans and other pertinent documents in the land parcel entries on the NRW Geoportal. Specifically, we examine whether these documents are available as PDF files and adhere to the expected structure outlined on the website.

Below, we plot a map of NRW, color-coded to indicate the outcomes of our web scraping efforts. Successful extractions are depicted in one color, indicating that PDF files of building plans were obtained from the web portal. Conversely, failures are represented in another color, signifying instances where no PDF file was found or where the provided link diverged from the standardized format, leading to a different portal.

document_texts_df = document_texts(usecols=['land_parcel_id'])
land_parcels_gdf=land_parcels()

land_parcels_gdf['DownloadStatus_missing']=land_parcels_gdf[
    'Land Parcel ID'
    ].apply(
        lambda x: x in document_texts_df['Land Parcel ID'].values
        )
print("This table contains", len(land_parcels_gdf), "land parcels")

This table contains 55836 land parcels

missing_true_gdf = land_parcels_gdf[
    land_parcels_gdf['DownloadStatus_missing']
                                    == True]
missing_false_gdf = land_parcels_gdf[
    land_parcels_gdf['DownloadStatus_missing'] ==
                                     False]
fig, ax = plt.subplots(figsize=(12, 8))

missing_true_gdf.plot(ax=ax, color='lightcoral',
                      label='Download Unsuccessful')
missing_false_gdf.plot(ax=ax, color='lightgreen',
                       label='Download Successful')

legend_labels = ['Download Unsuccessful', 'Download Successful']
legend_handles = [plt.Line2D([0], [0], marker='o', color='w',
                             markerfacecolor='lightcoral',
                             markersize=10,
                             markeredgecolor='darkred'),
                  plt.Line2D([0], [0], marker='o', color='w',
                             markerfacecolor='lightgreen',
                             markersize=10,
                             markeredgecolor='darkgreen')]

ax.legend(legend_handles, legend_labels, loc='lower right')

ax.set_xticks([])
ax.set_yticks([])
ax.set_title(
    'Analysis of Download Status of Building Plans on Land Parcels'
    )
plt.show()

Flooding Risk Analysis at the Land Parcel Level

Keywords in Question: 1. hq100 2. hqhäufig 3. hqextrem 4. vorranggebiete 5. vorbehaltsgebiete 6. betroffen_von_überschwemmung

hochwasser_df = pd.read_json("data/nrw/bplan/features/keywords/fuzzy_search/keyword_dict_hochwasser.json")
desired_categories = ['hq100', 'hqhäufig', 'hqextrem',
                      'vorranggebiete', 'vorbehaltsgebiete',
                      'betroffen_von_überschwemmung']
hochwasser_df=hochwasser_df[
    hochwasser_df['category'].isin(desired_categories)
    ]

hochwasser_df.head(5)

	contextualised_keyword	actual_keyword	category	filename
0	zum teil in diesen hinein. erdgeschossfußböden...	hq100	hq100	1956227.pdf
1	nis im mai 2001 wurde die mauer auf beiden sei...	hq100	hq100	2369661_1.pdf
2	(hq 100) wahrscheinlichkeiten einer überschwem...	hq100	hq100	2373299_2.pdf
3	potenzielle gefahr eines hochwasserereignisses...	hq100	hq100	2376399.pdf
4	hqextrem: entspricht bis auf weitere überschwe...	hq100	hq100	2379698_0.pdf

document_texts_df = document_texts()
combined_df = document_texts_df.merge(hochwasser_df, on='filename')
combined_df
joined_df = land_parcels_gdf.merge(combined_df, on='Land Parcel ID',
                                   how='outer')
hochwasser_merged_df=joined_df

for category in desired_categories:
    column_name = f'{category}_presence'
    hochwasser_merged_df[column_name] = hochwasser_merged_df[
        'category'].str.contains(category, case=False, na=False)

presence_columns = ['hq100_presence', 'hqhäufig_presence',
                    'hqextrem_presence',
                    'vorranggebiete_presence',
                    'vorbehaltsgebiete_presence',
                    'betroffen_von_überschwemmung_presence']

fig, axes = plt.subplots(2, 3, figsize=(18, 12))
fig.suptitle('Presence of Categories on Land Parcel Level',
             fontsize=16)

for i, column_name in enumerate(presence_columns):
    row, col = i // 3, i % 3
    ax = axes[row, col]

    hochwasser_merged_df[hochwasser_merged_df[column_name] ==
                         False].plot(ax=ax, color='orange',
                                     markersize=10, label='Not Present'
                                     )
    hochwasser_merged_df[hochwasser_merged_df[column_name] ==
                         True].plot(ax=ax, color='green',
                                    markersize=10, label='Present')
    ax.set_xticks([])
    ax.set_yticks([])
    ax.set_title(f'Presence of "{column_name}"')


legend_handles = [plt.Line2D([0], [0], marker='o', color='w',
                             markerfacecolor='green', markersize=10),
                  plt.Line2D([0], [0], marker='o', color='w',
                             markerfacecolor='orange', markersize=10)]
legend_labels = ['Present', 'Not Present']

for ax in axes.flat:
    ax.legend(legend_handles, legend_labels, loc='lower right')

plt.tight_layout(rect=[0, 0.03, 1, 0.95])
plt.show()

Extract content from regional plans

To extract content from regional plans, i.e. parse the text from the pdfs and divide them into chapters / sections. This yields the regional_plan_sections dataset. Next, we can visualise keyword occurences.

regional_plan_sections = pd.read_json("data/nrw/rplan/features/regional_plan_sections.json")
exact_result, exact_keywords = rplan_exact_keyword_search(input_df=regional_plan_sections)

plot_keyword_search_results(result_df=exact_result,
                            keyword_columns=exact_keywords,
                            title="Exact Keyword Search Results"
                            "for Regional Plans")