Feature Extraction

This section contains the API documentation for the feature extraction, such as the keyword extraction (fuzzy or exact) and the document categorization.

Document Categorization

This function is a binary categorization of the documents into BP and non-BP documents. It uses the BP keywords and the BP document structure to detect BP documents.

run_bp_keyword_detector(text_file_path: str, original_files_path: str, text_column: str = 'content', id_column: str = 'filename', sample_n: int = None) → DataFrame[source]

Run the BP and keyword detector.

Checks if the text contains the BP pattern and if the keywords are found. Also adds the categorization from the original dataframe.

Parameters:

text_file_path (str) – path to the text file
original_files_path (str) – path to the original files
text_column (str) – name of the column with the text
id_column (str) – name of the column with the unique file identifier
sample_n (int) – number of rows to sample

Returns:

dataframe with columns filename, document_category and the unique document id.

Return type:

pd.DataFrame

Textual Feature Extraction

This section contains the API documentation for the textual feature extraction, such as the keyword extraction (fuzzy or exact). The fuzzy keyword extraction is based on the Levenshtein distance and the exact keyword extraction is based on regular expression or substring matching.

enrich_extracts_with_metadata(info_df: DataFrame, text_df: DataFrame)[source]

Function that joins BP-metadata and BP-text to produce the document_texts table.

Parameters:

info_df – df containing metadata
text_df – df containing extracted text

Returns:

merged df

Return type:

final_df

Fuzzy Keyword Search

find_best_matches(id: str, content: str, keyword: str, threshold: int, context_words: int) → list[source]

Function that finds best matches for a given keyword / keyword combination in the input string.

This function utilizes the thefuzz package to find best matches for a given keyword / keyword combination in the input string. It returns a list of dictionaries containing the id, keyword, matched phrase, and similarity score per match. The fuzz ratio is used to calculate the similarity score. It leverages the Levendstein distance to calculate the similarity between two strings. The score is normalized between 0 and 100, with 100 being the most similar.

Parameters:

id – identifier of the content (e.g.,filename)
content – textual input to be searched for keyword
keyword – one or multiple word input of interest
threshold – for similarity search between input and keyword
context_words – get surrounding context of -x and +x words

Returns:

list of dictionaries containing

id,
keyword,
matched_phrase, and
similarity score per match

Return type:

dict

search_df_for_best_matches(input_df: DataFrame, id_column_name: str, text_column_name: str, keyword: str, threshold: int = 70, context_words: int = 3) → DataFrame | None[source]

Function that searches df for best matches.

This function iterates through the input_df, uses the id and content columns to search for the given keyword. If multiple matches are found, they are aggregated into one cell (using ‘;;; ‘ as separator).

Parameters:

input_df – expected to have 2 columns, i.e., ‘id’ and ‘content’ (exact naming may differ)
id_column_name – name of the identifying column (e.g., filename)
text_column_name – name of the column holding the relevant text
keyword – one or multiple word input of interest
threshold – for similarity search between input and keyword
context_words – get surrounding context of -x and +x words

Returns:

df holding the best matches per id. If multiple are found,: they are aggregated into one cell (using ‘;;; ‘ as separator).

Return type:

pd.DataFrame

search_best_matches_dict(input_df: DataFrame, id_column_name: str, text_column_name: str, keyword_dict: dict, threshold: int, context_words: int)[source]

Function that enables fuzzy search with keyword_dictionary input.

This function iterates through the input_df, uses the id and content columns to search all occurences of the keywords in the keyword_dict. If multiple matches are found, they are aggregated into one cell (using ‘;;; ‘ as separator).

Parameters:

input_df – expected to have 2 columns, i.e., ‘id’ and ‘content’ (exact naming may differ)
id_column_name – name of the identifying column (e.g., filename)
text_column_name – name of the column holding the relevant text
keyword_dict – dictionary of relevant keywords
threshold – for similarity search between input and keyword
context_words – get surrounding context of -x and +x words

Returns:

df holding the best matches per id. If multiple are found,: they are aggregated into one cell (using ‘;;; ‘ as separator).

Return type:

pd.DataFrame

search_df_for_best_matches_keyword_dict(input_df: DataFrame, id_column_name: str, text_column_name: str, keyword_dict: dict, default_threshold: int = 70, context_words: int = 3, boolean_output: bool = True)[source]

Wrapper function to search for multiple keywords in a df. This function is a wrapper around search_df_for_best_matches() and search_best_matches_dict(). It enables fuzzy search with a keyword_dictionary input.

Parameters:

input_df – expected to have 2 columns, i.e., ‘id’ and ‘content’ (exact naming may differ)
id_column_name – name of the identifying column (e.g., filename)
text_column_name – name of the column holding the relevant text
keyword_dict – dict of keywords to be searched for
default_threshold – for similarity search between input and keyword
context_words – get surrounding context of -x and +x words
boolean_output – defaults to True; if True, df is returned with booleans instead of strings

Returns:

df holding the best matches per id. If multiple are found,: they are aggregated into one cell (using ‘;;; ‘ as separator).

Return type:

all_matches

Exact Keyword Search

search_text_for_keywords(text: str, keyword_dict: dict) → dict[source]

Function to find one-word or multiple-word keywords in input text.

This function is used to search for keywords in a text. It takes as input a text and a dictionary with keywords and returns a dictionary with the keywords found in the text. it is case-insensitive and uses substring matching.

Parameters:

text – Input string to be searched for keywords
keyword_dict – Dictionary listing relevant keyword values per key

Returns:

Dictionary with found keywords per input

Return type:

result

search_df_for_keywords(input_df: DataFrame, id_column_name: str, text_column_name: str, keyword_dict: dict, boolean: bool = False) → DataFrame[source]

Function to process columns row by row, checking for all entries from keyword dict.

This function is used to search for keywords in a df. It takes as input a df and a dictionary with keywords and returns a df with the keywords found in the text. It leverages the search_text_for_keywords function.

Parameters:

input_df – Input df to be searched for keywords
id_column_name – Name of the identifying column (e.g., filename)
text_column_name – Name of the column in the input df holding the relevant text
keyword_dict – Dict of relevant keywords
boolean – defaults to False; if True, df is returned with booleans instead of strings

Returns:

Output df holds found keywords per key (column) and id (row)

Return type:

df

Agent

This section contains the API documentation for the agent extraction. The agent extraction is based on a Large Language Model and the extraction is done by the GPT-3.5 model. It extracts detailed information from the results of a fuzzy search. One example would be the extraction of the “Grundflächenzahl” from the sentence “Die Grundflächenzahl beträgt 0,5”.

Warning

The agent extraction is not fully evaluated. Even though simple backchecking, e.g. checking if the resulting number was actually in the sentence, was done, the extraction is not fully evaluated. The extraction is based on a large language model and the results are thus not always correct. One might want to check the results manually before using them.

extract_knowledge_from_df(keyword_dict: dict, input_df: DataFrame, id_column_name: str, text_column_name: str, model_name: str = 'gpt-3.5-turbo-0613') → DataFrame[source]

Function that extracts relevant value from text input, if present and validates it.

This function is used to extract relevant information from a df of text snippets. It takes as input a df and a dictionary with keywords and returns a df with the extracted information and a validation column.

Parameters:

keyword_dict – dictionary containing keyword, keyword_short and template_name
input_df – df
id_column_name – name of the identifying column (e.g., filename)
text_column_name – name of the column holding the relevant text
model_name – name of the llm used (for OpenAI API Call)

Returns:

containing input_text, extracted_value, validation

Return type:

pd.DataFrame

Regional Plan Keyword Search

This section contains the API documentation for the regional plan keyword extraction. The regional plan keyword extraction is based on the regional plan structure and the regional plan keywords. It splits the regional plan into sections and extracts the keywords from the sections.

rplan_exact_keyword_search(input_df: DataFrame, save_path: str = None, drop_false_rows=False)[source]

Function to search for keywords in a df.

This function uses excat matching to find the best matches for the keywords. It uses the extracted content from the rplan pdfs as input. The keywords are stored in a json file. It basically uses the search_df_for_keywords function from the contextual_exact_search module.

Parameters:

input_df – Input df to be searched for keywords
save_path – defaults to None; if None, the result is not saved
drop_false_rows – defaults to False; if True, rows with all False values are dropped

Returns:

Result df of the keyword search

Return type:

pd.DataFrame

rplan_fuzzy_keyword_search(input_df: DataFrame, save_path: str = None, drop_false_rows=False)[source]

Function to search for keywords in a df.

This function uses fuzzy matching to find the best matches for the keywords. It uses the extracted content from the rplan pdfs as input. The keywords are stored in a json file. It basically uses the search_df_for_best_matches_keyword_dict function from the contextual_fuzzy_search module.

Parameters:

input_df – Input df to be searched for keywords
save_path – defaults to None; if None, the result is not saved
drop_false_rows – defaults to False; if True, rows with all False values are dropped

Returns:

Result df of the keyword search rplan_keywords.keys(): List of keywords used for the search

Return type:

pd.DataFrame

negate_keyword_search(input_df: DataFrame, negate_keyword_dict_path: str, keyword_column: str = 'section')[source]

Function to negate the result of the keyword search.

This function removes rows from the input df if the negate keywords are found in the text. It is a simple exact matching search.

Parameters:

input_df – Input df to be searched for keywords
keyword_column – Name of the column in the input df holding the relevant text
negate_keyword_dict_path – Path to the negate keyword dict

Returns:

Result df of the keyword search with additional columns from the input df

Return type:

pd.DataFrame

plot_keyword_search_results(result_df, keyword_columns: list, title: str = 'Keyword Search Results')[source]

Function to plot the results of a keyword search

Parameters:

result_df – Result df of the keyword search
keyword_columns – List of columns containing the keywords
title – Title of the plot