Synthetic (Numpy) Code Generator

This a combination of proof-of-concept scripts where I set out to generate synthetic data that could potentially be used to train a machine learning model. Code could be a good source of data because the success of the artificial generation could be verified and measured with some objective performance metrics (like runtime or memory use). Here's how the project works:

Read a Python library (I chose numpy) and extract its functions and their code. (generate_inputs.py)
Use OpenAI's chat completion API to read that code and generate some potential input to the function. (generate_inputs.py)
Execute those inputs via Docker and capture the output (including any errors it might spit out). (generate_inputs.py)
Call the OpenAI API again to generate a "better version" of the original function code and save that. (generate_rewrites.py and clean_rewrites.py)
Execute the new code with the previously generated inputs to ensure that the outputs are the same as the original code. (test_rewrites.py)

Results

It's certainly possible! Here's a screenshot of the final tests of some rewritten code:

Here's an example of some artificially rewritten code:

    def array_equiv(a1, a2):
        """
        Returns True if input arrays are shape consistent and all elements equal.

        Shape consistent means they are either the same shape, or one input array
        can be broadcasted to create the same shape as the other one.

        Parameters
        ----------
        a1, a2 : array_like
            Input arrays.

        Returns
        -------
        out : bool
            True if equivalent, False otherwise.

        Examples
        --------
        >>> np.array_equiv([1, 2], [1, 2])
        True
        >>> np.array_equiv([1, 2], [1, 3])
        False

        Showing the shape equivalence:

        >>> np.array_equiv([1, 2], [[1, 2], [1, 2]])
        True
        >>> np.array_equiv([1, 2], [[1, 2, 1, 2], [1, 2, 1, 2]])
        False

        >>> np.array_equiv([1, 2], [[1, 2], [1, 3]])
        False

        """
        import numpy as np  # import numpy

        # Try to convert the inputs to numpy arrays
        try:
            input_array1, input_array2 = np.asarray(a1), np.asarray(a2)
        except ValueError:  # Specifically handle ValueError exceptions
            return False

        # Check if the arrays are shape consistent
        if input_array1.shape != input_array2.shape:  # Compare the shapes directly
            return False

        # Check if all elements are equal
        return np.array_equal(input_array1, input_array2)  # Use numpy's array_equal instead of == operator

Next Steps:

Some cool things could be done in a future version of these scripts:

Use the numpy documentation to generate more realistic inputs to the functions
Build the Dockerfile to accept more than just numpy as an input package
Test different code generation prompts to do more invasive changes to each function
Benchmark the runtime of the original code and the rewritten code to see if there's a performance or memory improvement
Detect unit tests and use those to evaluate the success of the artificial code as well

Setup

Install Python 3.8
Install Docker
Install poetry

Running

poetry run python [script_name].py

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
assets		assets
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
analyze_input_generation.py		analyze_input_generation.py
clean_rewrites.py		clean_rewrites.py
generate_inputs.py		generate_inputs.py
generate_rewrites.py		generate_rewrites.py
poetry.lock		poetry.lock
print_rewrittend_code.py		print_rewrittend_code.py
pyproject.toml		pyproject.toml
test_rewrites.py		test_rewrites.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Synthetic (Numpy) Code Generator

Results

Next Steps:

Setup

Running

About

Releases

Packages

Languages

License

birdsean/synthetic-code-gen-numpy

Folders and files

Latest commit

History

Repository files navigation

Synthetic (Numpy) Code Generator

Results

Next Steps:

Setup

Running

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages