Feature Extraction

This section contains the API documentation for the feature extraction, such as the keyword extraction (fuzzy or exact) and the document categorization.

Document Categorization

This function is a binary categorization of the documents into BP and non-BP documents. It uses the BP keywords and the BP document structure to detect BP documents.

run_bp_keyword_detector(text_file_path: str, original_files_path: str, text_column: str = 'content', id_column: str = 'filename', sample_n: int = None) DataFrame[source]

Run the BP and keyword detector.

Checks if the text contains the BP pattern and if the keywords are found. Also adds the categorization from the original dataframe.

Parameters:
  • text_file_path (str) – path to the text file

  • original_files_path (str) – path to the original files

  • text_column (str) – name of the column with the text

  • id_column (str) – name of the column with the unique file identifier

  • sample_n (int) – number of rows to sample

Returns:

dataframe with columns filename, document_category and the unique document id.

Return type:

pd.DataFrame

Textual Feature Extraction

This section contains the API documentation for the textual feature extraction, such as the keyword extraction (fuzzy or exact). The fuzzy keyword extraction is based on the Levenshtein distance and the exact keyword extraction is based on regular expression or substring matching.

enrich_extracts_with_metadata(info_df: DataFrame, text_df: DataFrame)[source]

Function that joins BP-metadata and BP-text to produce the document_texts table.

Parameters:
  • info_df – df containing metadata

  • text_df – df containing extracted text

Returns:

merged df

Return type:

final_df

Agent

This section contains the API documentation for the agent extraction. The agent extraction is based on a Large Language Model and the extraction is done by the GPT-3.5 model. It extracts detailed information from the results of a fuzzy search. One example would be the extraction of the “Grundflächenzahl” from the sentence “Die Grundflächenzahl beträgt 0,5”.

Warning

The agent extraction is not fully evaluated. Even though simple backchecking, e.g. checking if the resulting number was actually in the sentence, was done, the extraction is not fully evaluated. The extraction is based on a large language model and the results are thus not always correct. One might want to check the results manually before using them.

extract_knowledge_from_df(keyword_dict: dict, input_df: DataFrame, id_column_name: str, text_column_name: str, model_name: str = 'gpt-3.5-turbo-0613') DataFrame[source]

Function that extracts relevant value from text input, if present and validates it.

This function is used to extract relevant information from a df of text snippets. It takes as input a df and a dictionary with keywords and returns a df with the extracted information and a validation column.

Parameters:
  • keyword_dict – dictionary containing keyword, keyword_short and template_name

  • input_df – df

  • id_column_name – name of the identifying column (e.g., filename)

  • text_column_name – name of the column holding the relevant text

  • model_name – name of the llm used (for OpenAI API Call)

Returns:

containing input_text, extracted_value, validation

Return type:

pd.DataFrame